skyrocket analytics with sas process automation group...and motorola before he got intoxicated by...
TRANSCRIPT
Brian has master’s degrees in business administration (University of British Columbia) and statistics (Peking University). He was a mechanical engineer at Panasonic and Motorola before he got intoxicated by data magic using SAS in 2005. He is now experienced in DM, CRM, and OR in the telecom, entertainment, and health care industries, and was an executive committee member of the Vancouver SAS user group before relocating to Ontario and joining RIM as a senior data analyst.
Picture Courtesy of Waj Hoda
Next Presentation: “I, Robot”: Skyrocket Analytics with SAS Process Automation
Presenter: Brian Sun
“I, Robot”, Skyrocket Analytics with SAS Process Automation
Brian Sun Jun 8, 2012
[email protected] http://ca.linkedin.com/in/guanghuisun
What’s on the Table
2
• Ideas bigger than building a single SAS macro: Automation system, Business process optimization
• Ideas more than automation: Management, Collaboration
• An example which can be extended: - Small/medium size business/organization - Analysis team as shared service in large corporation
Agenda
3
Statistician/Analyst Manager
Business Process
• Why
• How
• Impact
Analytics
4
Why
Need for Transparency
Amy (Researcher) got an idea!!!
Jennifer (IT) extracted the raw data Brian (Statistician) did some
SECRET work on the data SILENTLY
Publication Amy (researcher) wrote the paper 5
Managerial Concerns
No idea of Brian’s work load
No idea of the process in the black box
When Brian is on vacation, Amy is automatically on vacation
No idea whether Brian did a good job or not
6
Brian’s Nightmares
7
Workload
Data
Code
Requests
……
Being a “hotspot” in the office
Struggle to meet many deadlines
Always dynamic
Never clean
Repeated, and tedious!
Manage rules in the ocean of codes
Revisit your SAS code after 3 months
“what if…”
“Can you analyze the whole database?”
“Can you do statistical tests for everything?”
Consequence
Brian would evolve into a Coding Machine
Brian in 2011 Brian in 2021
Analytics run OUT of CONTROL
Boss would be very upset
8
9
What Doesn’t Kill You Makes You Stronger!
Or……
10
How
Work Flow Rule Task
Glossary
11
Process Automation - Roadmap
12
Process Mapping
12 Translation
Rules Identification
Repeated Tasks
YES
Automatic Decisions
Centralization
Human Decisions
Analytics
SAS Code
OUT IN
IN Ad hoc
NO
Strategy
Standardization
Inside the “Dark Room” of Analytics
13
Process Mapping: Circle of the Analytics
Multivariate Analysis / Statistical Computation
Bivariate Analysis
Univariate Analysis
Data Preparation
Analysis Plan / Study Design
Data Exploration 25%
5%
40%
5%
15%
10%
Data Availability/Missing Check Distinct Value Calculation/Freq Identify Data Quality Issue Preliminary Analysis on Raw Data
Understand Research Questions Literature Review (Clinical/statistical) Data Preparation Plan Statistical Analysis Plan
Data Extract/Transform Cohort Selection Data Integration/ Calculation Validation
Frequency Analysis Distribution Analysis Outlier/Extreme Value
2*2 Frequency Analysis Scatter Plot/Trend Analysis Statistical Tests for Correlation One-Way ANOVA
Hypothesis Tests Multivariate Regression Analysis Longitudinal Data Analysis Ad hoc Model /Computation
14
Process Mapping - Details
15
Research Protocol
Fill the data Request Form
Raw Master Table Build
Data Summary & Univariate Analysis
Data Selection Revisit/
Research Protocol Revise
Update Data Request Form
Raw Master Table Build Revisit
Researcher
Statistician/ Analyst
Analysis Plan/ Study Design
Rules for Data Recode, Transform, Derived
Variable, etc.
Build Final Master Table
Revisit Data Summary &
Univariate Analysis
Bivariate Analysis
Model Identification Plan
Multivariate Analysis/Modeling
Model Calibration /Selection
Update Multivariate
Analysis/Modeling
Model Diagnosis/ Validation
Analysis Report
Modify?
Research Paper Draft
Paper Submission to Journal
Discuss the Analysis Result
Y
N
Collaborative Processes
16
0 Can not be
Standardized /Automated
9 Can be Fully
Standardized /Automated
Standardization of Work Flows and Tasks
Data Extraction
Data Transformation
Standardize the Work Flow - Data Preparation External Database/ Data
• Variable Format Conversion • Variable Rename • Value Recode • Category Regroup
• From Access database • From Excel data • From Ad hoc Format data
• Vertical Merge • Horizontal Merge • Information Consolidation
• Simple New Variable Creation • Scoring • Ad hoc Variable Creation/
Calculation
• Selection by Logic Expression • Selection in Multiple Records • Selection for Distinct Value
Master Table
Data Validation
• Range Check • Missing Value • Consistency
Data Selection
Data Calculation
Data Integration
5
0 Can not be Standardized /Automated 9 Can be Fully Standardized /Automated
Standardize Task and Rule Identification Data
Extraction
•From Excel data
Data Transformation
•Value Recode
0 Can not be Standardized /Automated 9 Can be Fully Standardized /Automated
8
8
Task Attribute (Rule) Value
Task Description Import survey tables (Excel format) to SAS Library
Path \\192.168.10.1\data\survey\2010\Halifax\
Filename Survey_Ver5.2_Halifax_20Dec2010.xls
Sheetname question set A
Output SAS Library survey
Output SAS Table Halifax_2010
Task: Extract data from Excel file
9 Correlation Test
Automated Rules (Decisions) for Correlation Test
N_thread=10 (Control parameter=> set by user)
Format N of distinct values Format N of distinct Values
Numeric any Numeric any Pearson Test
Char <=N_thread Char <=N_thread ChiSQ
Numeric any Char <=N_thread ANOVA Test
Char >N_thread Char >N_thread Exclude from Correlation test
…… …… …… …… ……
Varable 1 Variable 2Correlation test
Task Attribute (Rule) Value
Task Description Impute the survey day of community follow-up as 15 if it is missing
Input table Community_follow_up
Output table CFU_imputed
Target variable Survey_day
Record Condition Survey_day=.
Record Value 15
Task: Recode Variable
Opportunity for Process Automation
Multivariate Analysis / Statistical Computation
Bivariate Analysis
Univariate Analysis
Data Preparation
Analysis Plan / Study Design
Data Exploration
Work Flow Task Overall
0 0 0
5 8
9 9
9 9
3 8
6 8
7
9
9
5
7
0 Can not be Standardized /Automated 9 Can be Fully Standardized /Automated
Research Analytics
“SAS Robot” System
Gear to business Circle
Highly Flexible – independent with dataset/database
Super Time Saving
Smart (Automatic Decision Making)
Control Box (Centralized Rules & Decisions)
Customized Output Multivariate Analysis / Statistical Computation
Bivariate Analysis
Univariate Analysis
Data Preparation
Analysis Plan/ Study Design
Data Exploration
20
Power Master Table Builder
Analytics Report Robot
RegressionRobot
Automatic Database Extractor
21
Example 1 : “Can you analyze the whole database?”
Total N of Tables: 120
Power Analytics Report Robot
25 mins (SAS Server) 2 mins (Brian)
Task ID (Index)
A Complete Analysis of SASHelp Library
2) A 150-page PDF executive summary report is automatically summarized from the full
report, with index created for navigating the full report.
Total N of variables: 2050
Output:
1) A full PDF report of 1227 pages with all univariate plots and tables included.
Time used
Note: Detailed snapshots not available due to data confidential
22
Example 2 : “Can you do the test for everything?”
Total N of variables: 116
Total N of variable pairs for correlation test: 116*115/2=6670
Output:
1) A full PDF report of 9613 pages with all bivariate plots and
statistical test results included.
2) A 325-page PDF executive summary report (main body only
10 pages) is automatically summarized from the full report,
with index created for navigating the full report.
Analytics Report Robot
Time used 5 hours (PC)
2 minutes (Brian)
Analytics Mining
An Exhaustive Pair-wise Correlation Analysis
New Process Mapping
23
Research Protocol
Fill the data Request Form
Raw Master Table Build
Data Summary & Univariate Analysis
Data Selection Revisit/
Research Protocol Revise
Update Data Request Form
Raw Master Table Build
Revisit
Researcher
Statistician/ Analyst
Analysis Plan/ Study Design
Rules for Data Recode, Transform, Derived Variable, etc.
Revisit Data Summary &
Univariate Analysis
Bivariate Analysis
Model Identification Plan
Multivariate Analysis/ Modeling
Model Calibration /Selection
Update Multivariate
Analysis/Modeling
Model Diagnosis/ Validation
Analysis Report
Modify?
Research Paper Draft
Paper Submission to Journal
Discuss the Analysis Result
Y
N
Build Final Master Table
SAS Robot
Collaborative Processes Automated Processes
Response IndependentPearson
Coeff
P
Value
Model
NoModel Title Model Type
Include/
Exclude in
Model
Notes
(Blue: Clinical Researcher;
Green: Statistician)
Adverse_Event_A_Yes age 0.45 0.0091-1
Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
BS: it makes sense that elder people has
higher risk getting adverse events
Adverse_Event_A_Yes BMI 0.26 0.02
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
BS: I decided to exclude this variable from
analysis because the signiciant collearity
issue we found between BMI_Overweight25
and the high blood pressure.
Adverse_Event_A_Yes smoking_dummy -0.26 0.0041-1
Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
AN: No clinical explanation
Adverse_Event_A_Yes EAD24_ad_Yes 0.30 0.006
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
AN: I don’t think that we need both trach and
EAD as these are both types of tubes used to
treat a patient
Adverse_Event_A_Yes Gender_Female 0.22 0.0151-1
Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
AN: Literature support that female is more
likely to get adverse event
Adverse_Event_A_Yes Mortality_Yes -0.32 0.0061-1
Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
BS: Does not make logical sense. Mortality
should be an outcome variable (Model 3)
Collaboration Process: Model Identification Plan
24
1
2
3
3 Input
Multivariate Analysis
Bivariate Analysis
Output 1
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
BS: it makes sense that elder people has
higher risk getting adverse events
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionExclude
BS: I decided to exclude this variable from
analysis because the signiciant collearity
issue we found between BMI_Overweight25
and the high blood pressure.
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionExclude
AN: No clinical explanation
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionExclude
AN: I don’t think that we need both trach and
EAD as these are both types of tubes used to
treat a patient
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionInclude
AN: Literature support that female is more
likely to get adverse event
1-1Adverse Event (A)
Predictive Model
Binary Logistic
RegressionExclude
BS: Does not make logical sense. Mortality
should be an outcome variable (Model 3)
2
New Collaboration Model
25
Rule Input Tables
SAS Stored Processes
SAS office Add-in
Input, manage rules from domain point of view
Run stored process to view/update analysis report
Build, maintain SAS stored processes Input, modify rules
from data and model point of view
Researcher
Statistician /Analyst
SAS EG
26
Impact
Impact on the Analytics Life Circle
Multivariate Analysis / Statistical Computation
Bivariate Analysis
Univariate Analysis
Data Preparation
Analysis Plan / Study Design
Data Exploration 8%
50%
10%
<1%
<1%
30%
25%
5%
40%
5%
15%
10%
Before After
27
TOTA
L TI
ME
TOTA
L TI
ME
Impact on Manager and Researcher
28
Process Visibility Quality Control Project Planning
Take control of business rules
Run/update standard analytics report by themselves
Easy modification of hypotheses/Test “what if” question by themselves
Analytics Mining
Project management improved
Empowered and more engaged in analytics
Researchers
Manager
Impact on Statistician/Analyst
Brian in 2050 29
Brian in 2050
Brian in 2011
Impact on the Business Process
30
Amy (Researcher) got an idea!!!
Jennifer (IT) extracted the raw data
Brian (Statistician) did some SECRET work on
the data SILENTLY
Amy (researcher) wrote the paper
Publication
Rule Input Tables
SAS Stored Process
Input output
SAS office Add-in
Input, manage rules from subject-matter
point of view
Run stored process to
update analysis report
Build, maintain SAS stored procedures
Input, modify rules
from data and model
point of view
Researcher
Statistician/Analyst
SAS EG
Impact on Organization/Business
Process Automation
Productivity
Management
Time
Quality
Money
Publication Global
Reputation
Funding From Government
and Public 31
Conclusions
By building a highly functional and flexible SAS process automation system, we can
Skyrocket the analytics capacity. Re-engineer the business process. Improve management and collaboration. Save significant time and money.
32
This approach can be applied in other businesses! The success can be duplicated!
Question
33