using sas high performance statistics for predictive … · using sas high-performance package to...

17
USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE MODELLING Regan LU, CFA, FRM SAS certified Statistical Business Analyst & SAS certified Advanced Programmer Future of Work Taskforce Department of Jobs and Small Business

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

USING SAS HIGH PERFORMANCE

STATISTICS FOR PREDICTIVE

MODELLINGRegan LU, CFA, FRM

SAS certified Statistical Business Analyst & SAS certified Advanced Programmer

Future of Work Taskforce

Department of Jobs and Small Business

Page 2: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

Part 1: Using Predictive Modelling in the Public Sector

Part 2: Building Predictive Models using SAS

Part 3: Advantages of SAS High Performance Statistics

Part 4: Examples and Comparison

Page 3: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

PART 1: PREDICTIVE MODELLING IN

THE PUBLIC SECTOR

Use longitudinal administrative data to build

predictive models

Predictive models can be applied to resolve

policy problems

For instance, we could use a predictive model to

estimate the labour force participation of target

group in the 2016-17 financial year.

Page 4: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

PREDICTIVE MODELS COULD BE USED FOR EVIDENCE-BASED POLICY DEVELOPMENT

1. Generalised Linear Model (GLM)

Binomial

Multinomial

Gamma Distribution

2. Decision Trees

Classification Tree

Regression Tree

Page 5: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

BUILD PREDICTIVE MODELS USING DIFFERENT MACHINE LEARNING ALGORITMS FOR EVIDENCE-BASED POLICY DEVELOPMENT

Supervised Learning

• Generalised Linear Model (Forward, Backward, Stepwise)

• Decision Trees

• Random Forest

• Neural Network

• Supported Vector Machine, etc

Unsupervised Learning

• K-means Clustering

• Cosine Similarity

Page 6: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

MODEL VALIDATION 1: RECEIVER OPERATING

CHARACTERISTIC (ROC) CURVEUsing longitudinal data, a model was built to predict labour force participation rate of the target group next year.

The closer the curve follows the left-hand border and the top border of the ROC space, the more accurate the predictive model is.

Page 7: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

MODEL VALIDATION 2: PARTIAL DEPENDENCY PLOTUsing longitudinal data, a model was built to predict labour force participation rate of the target group next year.

The following chart indicates the accuracy of the prediction between different target groups.

A B C D

Page 8: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

PART 2 BUILDING PREDICTIVE MODELS USING SAS

Regression

Y = β1X1 + β2X2 +…..+ βLXL + ε

SAS Statistical Package:

• PROC LOGISTIC

• PROC GENMOD

High Performance SAS Statistics

• PROC HPLOGISTIC

• PROC HPGENSELECT

Page 9: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

SYNTAX: LOGISTIC PROCEDURE

PROC LOGISTIC DATA=DATASET <OPTIONS>;

MODEL RESPONSE=PREDICTOR /<OPTIONS>;

OUTPUT OUT=SAS-DATASET;

RUN;

SAS High-performance Statistics has similar syntax

Page 10: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

PART 3: ADVANTAGES OF SAS HIGH PERFORMANCE

STATISTICS

SAS high-performance statistics take advantage of parallel

processing. This is crucial for big data analytics.

PROC HPLOGISTIC DATA=DATASET <OPTIONS>;

PERFORMANCE CPUCOUNT=24 NTHREADS=24;

MODEL RESPONSE=PREDICTOR /<OPTIONS>;

OUTPUT OUT=SAS-DATASET;

RUN;

Page 11: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

ADVANTAGES OF SAS HIGH PERFORMANCE STATISTICS

For building predictive models using different machine

learning algorithms such as GLM, we found that high-

performance statistics significantly improved the

efficiency of building models.

However, high-performance statistics does not always

outperform traditional SAS package. (e.g. HPSUMMARY

for descriptive statistics)

Page 12: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

PART 4: EXAMPLES AND COMPARISONS

Build a model using GLM to predict the labour force attachment of a target

group.

SAS data: 638k observations with 121 variables

Page 13: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

COMPARISON OF TIME SPENT ON RUNNING ONE

REGRESSIONDuring the lunch break (Non-HP statistics)

PROC LOGISTIC(HP statistics)

PROC HPLOGISTIC (HP statistics)PROC HPLOGISTIC

Number of Threads (default: 1) 4 24

Real Time (minutes) 28:50.45 1:33.74 1:32.78

CPU Time (minutes) 28:50.68 16:33.03 16:20.17

Between 2p.m. and 3p.m. (Non-HP statistics) PROC LOGISTIC

(HP statistics)PROC HPLOGISTIC

(HP statistics) PROC HPLOGISTIC

Number of Threads (default: 1) 4 24

Real Time (minutes) 28:15.13 4:24.48 1:42.51

CPU Time (minutes) 28:14.54 16:22.67 16:44.18

Page 14: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT

MACHINE LEARNING ALGORITHMS

Machine Learning Algorithms 1 Core (High Performance Statistical Package)

24 Cores (High Performance Statistical Package)

Generalized Linear Model(Stepwise)

PROC HPGENSELECT

Real time: 31 mins 42 secsUser cpu time: 30 mins 44 secs

Real time: 3 mins 42 secsUser cpu time: 31 mins 03 secs

Decision Tree(Classification Tree)

PROC HPSPLIT

Real time: 21 secsUser cpu time: 19 secs

Real time: 11 secsUser cpu time: 26 secs

Random ForrestPROC HPFORREST

Real time: 10 mins 4 secsUser cpu time: 9 mins 55 secs

Real time: 4 mins 31 secsUser cpu time: 12 mins 15 secs

Neural Network(MLP one inner layer, 30 neurons)PROC HPNEURAL

Real time: 64 mins 23 secsUser cpu time: 64 mins 3 secs

Real time: 10 minsUser cpu time: 65 mins 41 secs

Page 15: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

Generalized Linear Model Decision Tree Neural Network

Pseudo-Rsquare=11.7% Pseudo-Rsquare=11.8% Pseudo-Rsquare=24.7%

COMPARE PREDICTIVE MODELS BUILT BY COMPUTER USING

DIFFERENT MACHINE LEARNING ALGORITHMS

Page 16: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

ADDITIONAL HINT

Using SAS high performance statistics, a model could be quickly built and risk factors could be automatically selected within a very short period of time (e.g. building a GLM from scratch only takes a few minutes.)

This makes big data analytics and machine learning feasible using our SAS server.

Page 17: USING SAS HIGH PERFORMANCE STATISTICS FOR PREDICTIVE … · USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT MACHINE LEARNING ALGORITHMS

THANKS!