machine learning overview (with sas software)

125
Copyright © 2012, SAS Institute Inc. All rights reserved. MACHINE LEARNING WITH SAS WORKSHOP GETTING THE MOST OUT OF YOUR DATA Longhow Lam

Upload: longhow-lam

Post on 21-Apr-2017

11.336 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WITH SAS WORKSHOPGETTING THE MOST OUT OF YOUR DATA

Longhow Lam

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

AGENDA AND SOME READING MATERIAL

Intro amp positioning of Machine learning SAS platform for Machine learning Overview of Specific methods Some examples

Further reading

An experimental comparison of classification techniques for imbalanced credit scoring data sets using SASreg Enterprise Minerhttpsupportsascomresourcespapersproceedings12129-2012pdf

Benchmarking state-of-the-art classification algorithms for credit scoring A ten-year updatehttpwwwbusiness-schooledacukwafcrc_archive201342pdf

An absolute recommender for more detail The elements of statistical learning Hasting Tibshirani amp Friedman httpwww-statstanfordedu~tibsElemStatLearn

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LONGHOW LAM SHORT BIO

MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs wiskunde) MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)

10+ year SAS experience (Base Stat Guide Miner VA VS) 10+ year R experience ( An introduction to R)

10 + year predictive modeling experience ABNAMRO ndash Risk modeler

Basel Credit risk ALM models BusinessampDecision ndash Quantitative consultant

ING Belgium Fortis Leaseplan Belgium Post

Experian ndash data mininer Collection Score Delphi credit score consulting

longhowlamFollow me

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

INTRO MACHINE LEARNING

WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR

Statisticalmodeling

SupervisedLearning

Clustering

UnsupervisedLearning

Data mining

Machine learning

Dimensionreduction

Association rules

Recommender

Autoencoders

Self organizing

maps

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IDENTIFY FORMULATE

PROBLEM

DATAPREPARATION

DATAEXPLORATION

TRANSFORMamp SELECT

BUILDMODEL

VALIDATEMODEL

DEPLOYMODEL

EVALUATE MONITORRESULTS

SAS In-Database ScoringSAS Decision Manager

BUSINESSMANAGER

SAS Model Manager

IT SYSTEMS MANAGEMENT

SAS Enterprise Guide

BUSINESSANALYST

Enterprise Miner Text MinerSAS IMSTAT Recommender

DATA MINER DATA SCIENTIST

THE ANALYTICS LIFECYCLE

SAS Visual AnalyticsSAS Visual Statistics

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 2: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

AGENDA AND SOME READING MATERIAL

Intro amp positioning of Machine learning SAS platform for Machine learning Overview of Specific methods Some examples

Further reading

An experimental comparison of classification techniques for imbalanced credit scoring data sets using SASreg Enterprise Minerhttpsupportsascomresourcespapersproceedings12129-2012pdf

Benchmarking state-of-the-art classification algorithms for credit scoring A ten-year updatehttpwwwbusiness-schooledacukwafcrc_archive201342pdf

An absolute recommender for more detail The elements of statistical learning Hasting Tibshirani amp Friedman httpwww-statstanfordedu~tibsElemStatLearn

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LONGHOW LAM SHORT BIO

MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs wiskunde) MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)

10+ year SAS experience (Base Stat Guide Miner VA VS) 10+ year R experience ( An introduction to R)

10 + year predictive modeling experience ABNAMRO ndash Risk modeler

Basel Credit risk ALM models BusinessampDecision ndash Quantitative consultant

ING Belgium Fortis Leaseplan Belgium Post

Experian ndash data mininer Collection Score Delphi credit score consulting

longhowlamFollow me

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

INTRO MACHINE LEARNING

WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR

Statisticalmodeling

SupervisedLearning

Clustering

UnsupervisedLearning

Data mining

Machine learning

Dimensionreduction

Association rules

Recommender

Autoencoders

Self organizing

maps

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IDENTIFY FORMULATE

PROBLEM

DATAPREPARATION

DATAEXPLORATION

TRANSFORMamp SELECT

BUILDMODEL

VALIDATEMODEL

DEPLOYMODEL

EVALUATE MONITORRESULTS

SAS In-Database ScoringSAS Decision Manager

BUSINESSMANAGER

SAS Model Manager

IT SYSTEMS MANAGEMENT

SAS Enterprise Guide

BUSINESSANALYST

Enterprise Miner Text MinerSAS IMSTAT Recommender

DATA MINER DATA SCIENTIST

THE ANALYTICS LIFECYCLE

SAS Visual AnalyticsSAS Visual Statistics

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 3: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LONGHOW LAM SHORT BIO

MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs wiskunde) MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)

10+ year SAS experience (Base Stat Guide Miner VA VS) 10+ year R experience ( An introduction to R)

10 + year predictive modeling experience ABNAMRO ndash Risk modeler

Basel Credit risk ALM models BusinessampDecision ndash Quantitative consultant

ING Belgium Fortis Leaseplan Belgium Post

Experian ndash data mininer Collection Score Delphi credit score consulting

longhowlamFollow me

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

INTRO MACHINE LEARNING

WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR

Statisticalmodeling

SupervisedLearning

Clustering

UnsupervisedLearning

Data mining

Machine learning

Dimensionreduction

Association rules

Recommender

Autoencoders

Self organizing

maps

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IDENTIFY FORMULATE

PROBLEM

DATAPREPARATION

DATAEXPLORATION

TRANSFORMamp SELECT

BUILDMODEL

VALIDATEMODEL

DEPLOYMODEL

EVALUATE MONITORRESULTS

SAS In-Database ScoringSAS Decision Manager

BUSINESSMANAGER

SAS Model Manager

IT SYSTEMS MANAGEMENT

SAS Enterprise Guide

BUSINESSANALYST

Enterprise Miner Text MinerSAS IMSTAT Recommender

DATA MINER DATA SCIENTIST

THE ANALYTICS LIFECYCLE

SAS Visual AnalyticsSAS Visual Statistics

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 4: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

INTRO MACHINE LEARNING

WikipedialdquoMachine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data Such algorithms operate by building a model based on inputs and using that to make predictions or decisions rather than following only explicitly programmed instructionsrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR

Statisticalmodeling

SupervisedLearning

Clustering

UnsupervisedLearning

Data mining

Machine learning

Dimensionreduction

Association rules

Recommender

Autoencoders

Self organizing

maps

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IDENTIFY FORMULATE

PROBLEM

DATAPREPARATION

DATAEXPLORATION

TRANSFORMamp SELECT

BUILDMODEL

VALIDATEMODEL

DEPLOYMODEL

EVALUATE MONITORRESULTS

SAS In-Database ScoringSAS Decision Manager

BUSINESSMANAGER

SAS Model Manager

IT SYSTEMS MANAGEMENT

SAS Enterprise Guide

BUSINESSANALYST

Enterprise Miner Text MinerSAS IMSTAT Recommender

DATA MINER DATA SCIENTIST

THE ANALYTICS LIFECYCLE

SAS Visual AnalyticsSAS Visual Statistics

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 5: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR

Statisticalmodeling

SupervisedLearning

Clustering

UnsupervisedLearning

Data mining

Machine learning

Dimensionreduction

Association rules

Recommender

Autoencoders

Self organizing

maps

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IDENTIFY FORMULATE

PROBLEM

DATAPREPARATION

DATAEXPLORATION

TRANSFORMamp SELECT

BUILDMODEL

VALIDATEMODEL

DEPLOYMODEL

EVALUATE MONITORRESULTS

SAS In-Database ScoringSAS Decision Manager

BUSINESSMANAGER

SAS Model Manager

IT SYSTEMS MANAGEMENT

SAS Enterprise Guide

BUSINESSANALYST

Enterprise Miner Text MinerSAS IMSTAT Recommender

DATA MINER DATA SCIENTIST

THE ANALYTICS LIFECYCLE

SAS Visual AnalyticsSAS Visual Statistics

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 6: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SAS SOFTWAREFOR MACHINE LEARNING (AND DATA MINING)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IDENTIFY FORMULATE

PROBLEM

DATAPREPARATION

DATAEXPLORATION

TRANSFORMamp SELECT

BUILDMODEL

VALIDATEMODEL

DEPLOYMODEL

EVALUATE MONITORRESULTS

SAS In-Database ScoringSAS Decision Manager

BUSINESSMANAGER

SAS Model Manager

IT SYSTEMS MANAGEMENT

SAS Enterprise Guide

BUSINESSANALYST

Enterprise Miner Text MinerSAS IMSTAT Recommender

DATA MINER DATA SCIENTIST

THE ANALYTICS LIFECYCLE

SAS Visual AnalyticsSAS Visual Statistics

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 7: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IDENTIFY FORMULATE

PROBLEM

DATAPREPARATION

DATAEXPLORATION

TRANSFORMamp SELECT

BUILDMODEL

VALIDATEMODEL

DEPLOYMODEL

EVALUATE MONITORRESULTS

SAS In-Database ScoringSAS Decision Manager

BUSINESSMANAGER

SAS Model Manager

IT SYSTEMS MANAGEMENT

SAS Enterprise Guide

BUSINESSANALYST

Enterprise Miner Text MinerSAS IMSTAT Recommender

DATA MINER DATA SCIENTIST

THE ANALYTICS LIFECYCLE

SAS Visual AnalyticsSAS Visual Statistics

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 8: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata structure = markovblanket model default = x1 LTV income age selction = YRUN

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 9: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING

Machine Learning algorithms designed to run on single blade or multi blade distributed memory environments

HIGH PERFORMANCE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 10: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Manage Rules + Data + Models

Deployment flexibility BatchReal TimeStored ProcessIn Database

Drive Reuse and Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 11: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICT SOMEONErsquoS INCOME

Income = 152 + 1102 times Age

Age

Income

Predict someones income from hisher age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 12: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear X2 X3 Log(X) Sqrt(X) 1X helliphellip

You do not have one input variable X1 X2 X3helliphellipX567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 13: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Suppose we have an untargeted direct mailing of 100000 lsquolettersrsquo to randomly sampled prospects

Conversion rate is around 1 Profit per conversion euro80 Cost per mailing is euro070 Total ROI = 100000 X 1 X euro 80 100000 X euro 070 = euro 10000

Now we have a targeted mailing with a machine learning predictive model that uses prospect input data that can distinguish between high low responders

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 14: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 200 9000 9000

2 10000 150 5000 14000

3 10000 100 1000 15000

4 10000 100 1000 16000

5 10000 100 1000 17000

6 10000 100 1000 18000

7 10000 100 1000 19000

8 10000 080 -600 18400

9 10000 050 -3000 15400

10 10000 020 -5400 10000

The profit by using a model to sent letters only to the first 7 deciles is now

euro 19000 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 09 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 15: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 300 17000 17000

2 10000 200 9000 26000

3 10000 140 4200 30200

4 10000 115 2200 32400

5 10000 100 1000 33400

6 10000 060 -2200 31200

7 10000 040 -3800 27400

8 10000 030 -4600 22800

9 10000 010 -6200 16600

10 10000 005 -6600 10000

The profit by using a much better model to sent letters only to the first 5 deciles is now

euro 33400 (instead of euro 10000)

If you have 100 of such campaigns a year that means an increase of

euro 234 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 16: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MACHINE LEARNING WHY IT CAN MATTER euro euro euro

Decile N Conversion Profit Cumulative1 10000 335 19800 19800

2 10000 223 10840 30640

3 10000 130 3400 34040

4 10000 110 1800 35840

5 10000 100 1000 36840

6 10000 055 -2600 34240

7 10000 028 -4760 29480

8 10000 025 -5000 24480

9 10000 005 -6600 17880

10 10000 002 -6840 11040

Now lets suppose we have even a slightly better model than the last one

euro 36840

If you have 100 of such campaigns a year that means an increase of

euro 268 mln

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 17: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OVERVIEW OF SPECIFIC MACHINE LEARNING METHODS

Classical regression Decision trees Dimension reduction Bagging amp Boosting Support vector machines

K-Nearest Neighbour Neural networks deep learning Bayesian networks Text mining Recommendation engine

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 18: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

ldquoCLASSICALrdquo REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 19: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LINEAR amp LOGISTIC REGRESSION

Income = a + b times Age

Age

Income

Age

P(Churn)1

0

P(Churn) =

Numeric target variable Binairy target variable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 20: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relationbull Transformation of inputs X2 X3 log(X) etchellipbull Buckets binning of variables

Y logit(y)

X

Smoothing Splines

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 21: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines Piecewise polynomials that are glued together at knots

Two special cases for λ

λ = 0 Any function that interpolates the data

λ = infin Simple Least square line fit

Choose λ by cross validation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 22: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site For many cars we have thekilometres driven and the car price For the Opel Astra we have 2360 cars What is the relation between km driven and car sales price

Too much smoothing and too little smoothing

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 23: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

02 is the optimal smoothing paramter

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 24: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Some other car makemodels with spline estimates of car depreciation versus kilometres driven

Hmmm my Renault Clio looks nice but after 50000 km I only have 46 of the original value lefthellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 25: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MODELING NON LINEARITIES

In SAS we have TPSLINE LOESS and the ADAPTIVEREG procedureto fit multivariate regression splines

Supports More than one input linear logistic Poisson GLM regressions combines both regression splines and model selection methods supports partitioning of data into training validation and testing roles

SPLINE REGRESSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 26: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 27: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How does it work A simple exampleSuppose we have the following group of people 50 Response 50 No Response

We haveknow Age and Marital Status

5050

Agele 45 Agegt 45

3070

6040

MarriedDivorced UnMarried

2080

6040

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 28: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES REGRESSION amp CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 12 X

N 21 B 456 15 X

Y 32 A 545 13 U

Y 34 C 443 11 U

N 23 A 345 17 U

N 13 B 567 12 X

N 45 A 654 19 X

hellip hellip hellip hellip hellip hellip

hellip hellip hellip hellip hellip hellip

Y 46 A 657 21 X

A recursive splitting algorithm

1 Loop trough all inputs2 Determine per input how to split3 Take the best input to split4 On the two new data sets apply 123 againhellip5 Stop somewherehellip

bull How to split X1 or X2 bull When to stop

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 29: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Regression tree Mean square error

Split s1 Split t1

x

Y Y

x

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 30: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES

How to splitNumber is usualy 2 or 3More splits will exhaust the data too fast

Why split X1 ltt1 beter dan X1 lts1 Regression Mean squared Error Classification

Mis-classification rate Cross-entropy Chi-Squared

Classification tree Mis classificatie rate

xSplit s1 Split t1

REGRESSION amp CLASSIFICATION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 31: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees (regressie amp classificatie)

When to stop Not too early not too late

PruningRemove parts the tree

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 32: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C45 C50

CART (Classification and Regression)

The difference is mainly in the different splitting options

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 33: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Decision trees pros and conspros Interaction between variables Interpretable rules Missing values easy to incorporate

cons Unstable ldquoLack-of-Smoothnesrdquo Fit of obvious (non)linear relations

man vrouw

Inkomen lt 45 K Leeftijd lt 33

Response rate

Opel Astras

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 34: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DIMENSION REDUCTION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 35: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that The largest variance is in the first coordinate The second largets variance is in the second coordinate Etchellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 36: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

X1

X2

P 1

P 2

x x x x x x x

xx

x

x

xx

x

x

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 37: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

P1

P2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 38: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

The Math behind

P = X W[ 119901 11 11990121

1199011119899 1199012119899

]=[ 11990911 11990921

1199091119899 1199092119899

] [11990811 119908 2111990812 119908 22]

w11 and w12 are the loadings corresponding to the first principle component

w21 and w22 are the loadings corresponding to the second principle component

With two dimensions In general

It turns out that the columns of W Are the eigenvalue vectors of the matrix XTX

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 39: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS ANALYSIS

Scaling the inputs is important here

Applications of PCADimension reductionVisualisation

Outlier anomalie detectie

PCA regression Use PC instead of the original inputs

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 40: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PRINCIPLE COMPONENTS DIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first 2 or 3 columns so that PL only has 2 or 3 columns that can be visualized in scatter or contour plots

XW

P=

XWLPL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 41: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 42: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition

Diagonal with r singular values [ could be a large number]

UAVT

Σ

Take only k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 43: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original2448 X 3264 ~ 8 mln numbers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 44: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 15 largest SVrsquos1 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 45: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVD EXAMPLE USING MY SON AS AN EXPERIMENT

SVD 75 largest Vrsquos5 of the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 46: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Variabele selection I have 500 inputs but maybe there are only ten clusters of inputs Within 1 cluster the variables are (strongly) correlated Then use only 1 input per cluster for predictive modeling

X1 X2 X3 hellip X500

X1 X21 X35 X430hellip X35

X17 X29 X353 X490hellip X29

X37 X95 X251 X393hellip X251

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 47: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 48: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

VARIABLE CLUSTERING TO REDUCE THE DIMENSION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 49: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAGGING amp BOOSTING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 50: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

COMBINE MODELS BAGGING amp BOOSTING

If one model is not good enough let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random sample

Final modeldata

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 51: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Bagging amp Boosting Random Forests

Random forests asymp Bagging with trees

Apply underlying steps repeatedly1 Generate a bootstrap sample2 Choose randomly m inputs m ltlt P 3 Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree The random forest prediction is the majority vote of all trees

In case of a regression tree The random forest prediction is the average of all trees

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 52: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100 sub trees) fitted on the simulated data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 53: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 54: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING DONrsquoT LET THE FORMULAS INTIMIDATE YOU

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 55: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

GRADIENT BOOSTING SCHEMATIC OVERVIEW

Gradient Boosting M iterations m = 12hellipM

Inputs x

r1

Final model FMhellip M

At each succesive iteration a base learner hm (which is a decision tree) is fit on the pseudo residuals using inputs x to ldquocorrectrdquo the previous learner

Pseudo residuals rim at each step

r2 rMInputs

xInputs

x

Fm = Fm-1 + γmiddothm

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 56: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SUPPORT VECTOR MACHINES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 57: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 58: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 59: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Support vector machines (SVM) Suppose we have a separable classification problem

Find a linear decision boundary between the two groups with maxium margin M So green line would be better than blue line

If not separable you have to allow that some points are on the wrong side These points are penalized SVM still maximizes the margin M but with the constraint that total penalty is smaller than C

The input space might not be linear We could apply non linear mappings to the inputs Ie x2 x3 of spline(x)

The beauty of SVM is that in the calculations of the decision boundary we do not need to explicitly use these transformations ldquoThe kernel trickrdquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 60: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMSSeparable classification

Non Separable classification

Non Separable classification rewritten using Lagrange Dual problem

Kernels to model nonlinear behaviour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 61: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

httpswwwyoutubecomwatchv=3liCbRZPrZA

Linear not separable but in 3D space they are

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 62: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K ndash NEAREST NEIGHBOUR

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 63: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

bull No model is fitted Given a query point x0 find the k points x1 x2 xk that are

closest in distance to x0bull Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red 2 of them are green so we predict x0 to be red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 64: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

1 nearest neighbour 15 nearest neighbour

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 65: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity k-nearest-neighbors has been successful used in problems like bull handwritten digits bull Satellite image scenes bull EKG patterns

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 66: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site For 108K Dutch postal codes (out of 463K) there are one or more houses for sale How can we estimate the house value for the postal codes without a house price

For a Postal code with no price estimate the price by taking the k closest house for sale prices

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 67: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Comparing different nearest neighbours in SAS Enterprise Miner

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 68: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

K-NN EXAMPLE DUTCH HOUSE PRICES

30 of the data was used as validation set In Enterprise Miner different values for k were used k=5 nearest neighboor has the lowest Average squared error

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 69: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 70: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKSDEEP LEARNING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 71: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORK LINEAR REGRESSION

f Y = f(Xw) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute nodef is the so-called activation function This could be the logit function but other choices are possible

There are four weights wrsquos that have to be determined

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 72: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

De functions g and σ are defined as

In case of a binary classifier

The model weights α and β have to be estimated from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 73: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wirsquo s For each data point (observation)

1 Calculate the neural net prediction2 Calculate the error E (for example E = (actual ndash prediction)2)3 Adjust weights w according to

4 Stop if error E is small enough

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 74: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 75: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layerFor visualisation

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 76: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODERS

httpsupportsascomresourcespapersproceedings14SAS313-2014pdf

Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 77: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network 25 ndash 15 ndash 2 ndash 15 ndash 25

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 78: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLE

bull 1000 images of digitsbull Each image has 400 pixelsbull So a 400 dimensional input vector X = (x1hellipx400)bull Compare two dimensional PCA with an neural net auto encoder

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 79: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

NEURAL NETS AUTOENCODER EXAMPLEproc neural

data= autoencoderTraining dmdbcat= workautoencoderTrainingCat performance compile details cpucount= 12 threads= yes

DEFAULTS ACT= TANH COMBINE= LINEAR IDS ARE USED AS LAYER INDICATORS ndash SEE FIGURE 6 INPUTS AND TARGETS SHOULD BE STANDARDIZED

archi MLP hidden= 5 hidden 300 id= h1 hidden 100 id= h2 hidden 2 id= h3 act= linear hidden 100 id= h4 hidden 300 id= h5 input corruptedPixel1 - corruptedPixel400 id= i level= int std= std target pixel1-pixel400 act= identity id= t level= int std= std BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM initial random= 123 prelim 10 preiter= 10

run

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 80: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Two dimensional representation of 400 dimensial lsquodigitrsquo data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 81: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 82: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

bull Nodes represent random variables bull Links between nodes represent conditional dependenciesbull Conditional probabilty tables are derived from training data for each node

bull Random variables are typically binary or discrete

bull The graph structure can be learned from the data

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 83: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 84: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 85: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

ldquoAdvancedrdquo word counting

Parse amp Filter Part of speech Entity detection Mixed numeric abbrev Stemming Spell checks Stop list Synonim list Multi-term words

Apply Traditional data mining Clustering Prediction machine learning

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 86: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING BASICS

Document 1 ldquoIk loop over straat in Amsterdam 1057DK met mijn fietsrdquoDocument 2 ldquoZij liep niet maar fietste met haar blauwe fieets bitlycomsdrtwrdquoDocument 3 ldquoMijn tweewieler is kapot wat een slecht stuk ijzer $$rdquo

Terms Doc 1 Doc 2 Doc 3+Fiets (znmw) 1 1 1Fietsen (ww) 0 1 0Blauwe (bvg) 0 1 0Amsterdam (locatie) 1 0 0+Lopen (ww) 1 1 0Straat (znmw) 1 0 0Kapot (bijw) 0 0 1Slecht 0 0 1Stuk Ijzer 0 0 11057DK (postcode) 1 0 0bitlycomsdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX Abull Each text document is (very) long vector

of word counts (often with many zeros)

bull Apply further mining on this matrix A

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 87: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document matrix

bull Often more terms than documents

bull Rows could be strongly correlated

bull Matrix is often very sparse

Apply Singular value decomposition first

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 88: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector say of length 300

Matrix SVD decompositie

Diagonal with r singular values [ could be many thousands ]

UAVT

Σ

take only the first k ltlt r singular values

Uk

Ak

VTk

Σk

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 89: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn fraud)

Apply machine learning to create a model f to predict the target

Automatically generate topics within large document collectionsApply clustering techniques to classify documents into clusters (topics)

Topic 1 Topic 2 Topic 3

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 90: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE Which product should I recommend my customers

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 91: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE USER ndash ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly Matrix is often very sparse 1 mln users 100K items ~ 001

User - Item Matrix ndash Data Item 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5 User 2 - - - 1 1 User 3 1 - 2 5 - User 4 - - 1 2 5 User 5 2 1 4 2 3 User 6 2 3 - 5 1 User 7 5 1 - 3 4 User 8 - 1 - 4 1 User 9 2 3 2 4 2 User 10 - 1 3 - 1

User 4s Item RatingsUser 4 - - 1 2 5

After some mathhellip recommendations are User 4 321 482 1 2 5

Recommend item 2

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 92: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RECOMMENDATION ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms Slope one (slope1) K nearest neighbors (knn)

Model-based algorithms Matrix factorization (SVD - LBFGS)

Market basket analysis Association rules mining (arm)

Mixture of different methods Clustering(cluster) Ensemble

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 93: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS SLOPE ONE

Y = x + b with slope equal to 1

See notes

Item-item based

Weight wij the number of users having rated both items i and j Rating ruj the average rating computed from item j

Sample rating databaseCustomer Item A Item B Item C

John 5 3 2

Mark 3 4

Lucy 2 5

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 94: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings ldquoin the neighborhoodrdquo

How to determine the neighbors and how many (k) to use

How to compute the similaritydistance measure bull Pearsonrsquos correlation coefficientbull Cosine distancebull Other adjustments

Similarity w

Neighbors N

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 95: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS

PEARSON CORRELATION users

rating of user for item

set of items rated both by and bull Possible similarity values between and

119956119946119950 (119938 119939 )=sum119953isin119927

(119955 119938 119953minus119955 119938)(119955 119939119953minus119955 119939)

radic sum119953isin119927

(119955119938 119953minus119955119938 )120784radic sum119953 isin119927

(119955119939 119953minus119955 119939)120784

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 96: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS K NEAREST NEIGHBORS METHOD

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 97: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem L-BFGS ALS

user

s

items

119894119895=119880 119894119879119881 119895Predict New Rating R

Minimize prediction error min119906 119907

sum119894 119895

(119877iquestiquest 119894119895minus119880 119894119879 119881 119895)

2+120582(iquest119880 1198942+119881 119895

2)iquestiquest

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 98: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHODS CLUSTER

Knn within one subgroup

Useritem profile

Useritem rating

Predictions

Clustering

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 99: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining Identify frequent itemsets (rules) in the transaction data

IF item A and B THEN item C IF item X THEN item Y

Not all rules are interesting use lsquosupportrsquo and lsquoliftrsquo to judge importance of a rule

trxs X Y

Total trxs Support (XY) =

Lift = Support (XY)

Support (X) Support(Y)

Support amp Lift Diapers Beer 08

Diapers Candles 0018

For example a lift of 25 means If people have X they are 25 more likely to buy Y than if they donrsquot have X

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 100: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 101: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PROC RECOMMEND recom = rsIENS

Add a recommendation system ADD rsIENS item = item user = user rating = rating

Add tables ADDTABLE LHL1209IENS_UIR recom = rsIENS type = rating vars=(item user rating)

Method SVD LBFGS met 20 factoren METHOD svd

factors = 20 label = svd fconv = 1e-3 gconv = 1e-3 maxiter = 100 MAXFEVAL = 5000 function = L2 lamda = 02 technique = lbfgs

RUN

METHOD ARM label = ARM

RUN

information on the recommender system INFOQUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 102: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

prediction with the SVD method

PROC RECOMMEND recom = rsIENS PREDICT

method = svdlabel = svdNum = 3users = (Longhow Lam)

run

QUIT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 103: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

LAST SLIDE

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 104: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance (more) difficult to explain

Black box approach (you are rejected The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithmhellip)

Interactions often ldquoautomaticallyrdquo taken into account

Superior for Text mining Image amp Speech recognition

Better lift possible (paar procent ldquogratisrdquo) It allows you to not think about the business problem

(compared to traditional linear logistic regression)PROS AND CONS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 105: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

WHY SAS FOR MACHINE LEARNING

bull Many different techniquesbull Easy to use GUIrsquos combined with flexible codingbull High performance scalabilitybull Easy Deployable

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 106: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SOME MACHINE LEARNING EXAMPLES

Text mining Image recognition Sound recognition Strange faces

So can a machine read see and hear

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 107: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

PREDICTING SENTIMENT FROM RESTAURANT REVIEWS

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 108: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS COLLECTED AROUND 16000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews

and transform reviews to data points in SVD space

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 109: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 110: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

Predicted review score vs Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 05R2 Neural Net = 06

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 111: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

IENS REVIEWS APPLY MODEL ON lsquoNEW REVIEWSrsquo

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 112: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA IN SASMODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 113: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST TRAINING DATA

42000 pictures of hand-written digits Each digit is a picture of 28 by 28 pixels So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 114: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 ndash Nearest Neighbour has the lowest misclassification rate 36 of the digits in the validation set are mis classified

7030 trainingvalidation split

PCA regression on 50 largest PCrsquos

Seven singel layer neural nets 3 6 12 24 48 100 200 neurons

Seven multi layer neural nets

Three Random forest 100 500 and 1000 trees

8 16 and 24 nearest neighbors

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 115: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

MNIST DATA APPLY MODEL ON TEST SET

28000 digits without known labels

Our best model predicted the label for these digits

First 100 predicted digits together with the handwritten digits are displayed here

Red numbers are predicted labels We see obvious some mistakeshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 116: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITIONDIGITS RECORDED WITH IPHONE

1 2

>
>
>
>
>
>
>
>
>
>

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 117: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

WAV files consists of ~ 30000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken lsquoonesrsquo in wav files 8 spoken lsquotwosrsquo in wav files

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 118: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 119: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data Also 8 lsquoonesrsquo and 8 lsquotwosrsquo

In Enterprise MinerNeural network with 9 neurons in one hidden layer

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 120: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Little joke on my colleagueshellip

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 121: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

Get free API key for Face++ Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces Predictive modeling machine learning

Who is the Brad Pit Nearest Neighbor

Strange faces proc neural auto-encoder

Create R script to Retrieve the SAS faces from our site put them trough the Face++ API Collect JSON results and store them in an ABT

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 122: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION LOOK ALIKE FACES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 123: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION BRAD PIT LOOK A LIKES

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 124: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION STRANGE FACES

SAS Faces Actors Faces

Read more on my blog

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)
Page 125: Machine learning overview (with SAS software)

Copyr igh t copy 2012 SAS Ins t i t u te I nc A l l r i gh ts res erved

STRANGE FACE DETECTION COMBO OF OPEN API R amp SAS

SAS Faces Actors Faces

Read more on my blog

  • Machine learning with SAS workshop
  • Agenda
  • Longhow lam
  • Intro
  • Machine learning
  • SAS SOFTWARE
  • The Analytics Lifecycle
  • Easy to use GUI
  • High performance
  • Easy deployable
  • Predict
  • Machine learning (2)
  • Machine learning (3)
  • Machine learning (4)
  • Machine learning (5)
  • Machine learning (6)
  • Overview of specific machine learning methods
  • ldquoClassicalrdquo regression
  • linear amp Logistic
  • Spline regression
  • Spline regression (2)
  • Spline regression (3)
  • Spline regression (4)
  • Slide 24
  • Spline regression (5)
  • Decision trees
  • Decision Trees
  • Decision trees (2)
  • Decision trees (3)
  • Decision trees (4)
  • Decision trees (regressie amp classificatie)
  • Decision trees (5)
  • Decision trees pros and cons
  • Dimension reduction
  • Principle Components
  • Principle Components (2)
  • Principle Components (3)
  • Principle Components (4)
  • Principle Components (5)
  • Principle components
  • Singular value
  • Singular value (2)
  • SVD example
  • SVD example (2)
  • SVD example (3)
  • Variable Clustering
  • Variable Clustering (2)
  • Variable Clustering (3)
  • Bagging amp Boosting
  • Combine models
  • Bagging amp Boosting Random Forests
  • Forest vs tree
  • FOREST vs TREE
  • Gradient boosting
  • Gradient boosting (2)
  • Support vector machines
  • Support vector machines (SVM)
  • Support vector machines (SVM) (2)
  • Support vector machines (SVM) (3)
  • SVM
  • Slide 61
  • K ndash nearest neighbour
  • k-NN
  • K-NN
  • K-nn
  • K-NN example
  • Slide 67
  • K-NN example (2)
  • Slide 69
  • Neural networks
  • Neural network
  • Neural networks (2)
  • Neural networks (3)
  • Deep learning
  • Neural nets
  • Neural nets (2)
  • Neural net
  • Neural nets (3)
  • Neural nets (4)
  • Slide 80
  • Bayesian networks
  • Bayesian
  • Slide 83
  • Text mining
  • Text mining
  • Text mining (2)
  • Text mining (2)
  • Text mining (3)
  • Text mining (3)
  • Recommendation engine
  • Recommendation engine
  • Recommendation engine (2)
  • RE methods
  • RE methods (2)
  • RE METHODS
  • RE Methods
  • RE Methods (2)
  • RE Methods (3)
  • RE Method
  • Method
  • Slide 101
  • Slide 102
  • Last slide
  • Pros and cons
  • Why SAS
  • Some machine learning examples
  • Predicting sentiment from restaurant reviews
  • Iens reviews
  • Use machine
  • Use machine (2)
  • Iens reviews (2)
  • MNIST Data in sas
  • MNIST
  • MNIST data
  • MNIST data (2)
  • Speech recognition
  • speech
  • Speech
  • speech
  • Strange Face detection
  • Strange Face detection (2)
  • Strange Face detection (3)
  • Strange Face detection (4)
  • Strange Face detection (5)
  • Strange Face detection (6)