using sas high performance statistics for predictive … · using sas high-performance package to...
TRANSCRIPT
USING SAS HIGH PERFORMANCE
STATISTICS FOR PREDICTIVE
MODELLINGRegan LU, CFA, FRM
SAS certified Statistical Business Analyst & SAS certified Advanced Programmer
Future of Work Taskforce
Department of Jobs and Small Business
Part 1: Using Predictive Modelling in the Public Sector
Part 2: Building Predictive Models using SAS
Part 3: Advantages of SAS High Performance Statistics
Part 4: Examples and Comparison
PART 1: PREDICTIVE MODELLING IN
THE PUBLIC SECTOR
Use longitudinal administrative data to build
predictive models
Predictive models can be applied to resolve
policy problems
For instance, we could use a predictive model to
estimate the labour force participation of target
group in the 2016-17 financial year.
PREDICTIVE MODELS COULD BE USED FOR EVIDENCE-BASED POLICY DEVELOPMENT
1. Generalised Linear Model (GLM)
Binomial
Multinomial
Gamma Distribution
2. Decision Trees
Classification Tree
Regression Tree
BUILD PREDICTIVE MODELS USING DIFFERENT MACHINE LEARNING ALGORITMS FOR EVIDENCE-BASED POLICY DEVELOPMENT
Supervised Learning
• Generalised Linear Model (Forward, Backward, Stepwise)
• Decision Trees
• Random Forest
• Neural Network
• Supported Vector Machine, etc
Unsupervised Learning
• K-means Clustering
• Cosine Similarity
MODEL VALIDATION 1: RECEIVER OPERATING
CHARACTERISTIC (ROC) CURVEUsing longitudinal data, a model was built to predict labour force participation rate of the target group next year.
The closer the curve follows the left-hand border and the top border of the ROC space, the more accurate the predictive model is.
MODEL VALIDATION 2: PARTIAL DEPENDENCY PLOTUsing longitudinal data, a model was built to predict labour force participation rate of the target group next year.
The following chart indicates the accuracy of the prediction between different target groups.
A B C D
PART 2 BUILDING PREDICTIVE MODELS USING SAS
Regression
Y = β1X1 + β2X2 +…..+ βLXL + ε
SAS Statistical Package:
• PROC LOGISTIC
• PROC GENMOD
High Performance SAS Statistics
• PROC HPLOGISTIC
• PROC HPGENSELECT
SYNTAX: LOGISTIC PROCEDURE
PROC LOGISTIC DATA=DATASET <OPTIONS>;
MODEL RESPONSE=PREDICTOR /<OPTIONS>;
OUTPUT OUT=SAS-DATASET;
RUN;
SAS High-performance Statistics has similar syntax
PART 3: ADVANTAGES OF SAS HIGH PERFORMANCE
STATISTICS
SAS high-performance statistics take advantage of parallel
processing. This is crucial for big data analytics.
PROC HPLOGISTIC DATA=DATASET <OPTIONS>;
PERFORMANCE CPUCOUNT=24 NTHREADS=24;
MODEL RESPONSE=PREDICTOR /<OPTIONS>;
OUTPUT OUT=SAS-DATASET;
RUN;
ADVANTAGES OF SAS HIGH PERFORMANCE STATISTICS
For building predictive models using different machine
learning algorithms such as GLM, we found that high-
performance statistics significantly improved the
efficiency of building models.
However, high-performance statistics does not always
outperform traditional SAS package. (e.g. HPSUMMARY
for descriptive statistics)
PART 4: EXAMPLES AND COMPARISONS
Build a model using GLM to predict the labour force attachment of a target
group.
SAS data: 638k observations with 121 variables
COMPARISON OF TIME SPENT ON RUNNING ONE
REGRESSIONDuring the lunch break (Non-HP statistics)
PROC LOGISTIC(HP statistics)
PROC HPLOGISTIC (HP statistics)PROC HPLOGISTIC
Number of Threads (default: 1) 4 24
Real Time (minutes) 28:50.45 1:33.74 1:32.78
CPU Time (minutes) 28:50.68 16:33.03 16:20.17
Between 2p.m. and 3p.m. (Non-HP statistics) PROC LOGISTIC
(HP statistics)PROC HPLOGISTIC
(HP statistics) PROC HPLOGISTIC
Number of Threads (default: 1) 4 24
Real Time (minutes) 28:15.13 4:24.48 1:42.51
CPU Time (minutes) 28:14.54 16:22.67 16:44.18
USING SAS HIGH-PERFORMANCE PACKAGE TO BUILD A PREDICTIVE MODEL FROM SCRATCH USING DIFFERENT
MACHINE LEARNING ALGORITHMS
Machine Learning Algorithms 1 Core (High Performance Statistical Package)
24 Cores (High Performance Statistical Package)
Generalized Linear Model(Stepwise)
PROC HPGENSELECT
Real time: 31 mins 42 secsUser cpu time: 30 mins 44 secs
Real time: 3 mins 42 secsUser cpu time: 31 mins 03 secs
Decision Tree(Classification Tree)
PROC HPSPLIT
Real time: 21 secsUser cpu time: 19 secs
Real time: 11 secsUser cpu time: 26 secs
Random ForrestPROC HPFORREST
Real time: 10 mins 4 secsUser cpu time: 9 mins 55 secs
Real time: 4 mins 31 secsUser cpu time: 12 mins 15 secs
Neural Network(MLP one inner layer, 30 neurons)PROC HPNEURAL
Real time: 64 mins 23 secsUser cpu time: 64 mins 3 secs
Real time: 10 minsUser cpu time: 65 mins 41 secs
Generalized Linear Model Decision Tree Neural Network
Pseudo-Rsquare=11.7% Pseudo-Rsquare=11.8% Pseudo-Rsquare=24.7%
COMPARE PREDICTIVE MODELS BUILT BY COMPUTER USING
DIFFERENT MACHINE LEARNING ALGORITHMS
ADDITIONAL HINT
Using SAS high performance statistics, a model could be quickly built and risk factors could be automatically selected within a very short period of time (e.g. building a GLM from scratch only takes a few minutes.)
This makes big data analytics and machine learning feasible using our SAS server.
THANKS!