about the presenter: david j corliss david corlis… · about the presenter: david j corliss...
TRANSCRIPT
![Page 1: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/1.jpg)
About the Presenter: David J Corliss
• PhD in statistical astrophysics;; formerly part-time faculty at Wayne State University• Analytics Architect in the automotive industry • Work focuses on bringing university research in bog data and time series analysis to the private sector• Founder of Peace-Work, a volunteer cooperative of statisticians, data scientists and other researchers applying analytics to issues in poverty, education and social justice
![Page 2: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/2.jpg)
Best Practices in Big Data
David J Corliss, PhDPeace-Work
4/27/2016
IHBIThe Institute for Healthand Business Insight
![Page 3: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/3.jpg)
OUTLINE
Data Management
Sampling and Coding for Big Data
Tests For Model Performance
Distributed Computing
Summary
![Page 4: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/4.jpg)
Data Management for Big Data
• Pre-screen records and variables
• Process only the records and variables needed
• Efficient Data Step Coding
• Use less computationally intensive methods
![Page 5: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/5.jpg)
Bad Data Management 101Proc sort data=applicants;
by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
![Page 6: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/6.jpg)
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101Unnecessary Sort
![Page 7: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/7.jpg)
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
![Page 8: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/8.jpg)
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Models allvariables
![Page 9: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/9.jpg)
Proc sort data=applicants;by demographic_seg ID;
proc genmod data=applicants;
class demographic_seg;
model accept = var1—var221 /
dist = bin
link = logit
lrci; run;
Bad Data Management 101
Doesn’t screenvariables first
Unnecessary Sort
Models allvariables
Computationally intensivebut not needed
![Page 10: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/10.jpg)
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Managing Big Data
![Page 11: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/11.jpg)
Managing Big Data
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
![Page 12: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/12.jpg)
Managing Big Data
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
Select candidate variables
![Page 13: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/13.jpg)
Managing Big Data
proc glmselect data=applicants(where ranuni(0) le 0.001);
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
Select candidate variables
Computationally lightestsufficient method
![Page 14: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/14.jpg)
Managing Big Data
proc glmselect data=applicants(where=(ranuni(0) le 0.001));
model accept=var1—var221/selection=lasso(stop=none choose=sbc);
run;
proc logistic data=applicants;class demographic_seg;model accept =
var12 var57 var125 var203;run;
Test on a sample
Select candidate variables
Computationally lightestsufficient method
Model onlyscreened variables
![Page 15: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/15.jpg)
Sampling for Big Data
• Develop analytic processes using sample
• Sample Size
• Representative Samples
• Testing Sample Quality
![Page 16: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/16.jpg)
Efficient Coding for Big Data
• Read only the variables needed for analysis
• Pass the data as few times as possible
• Use formats instead of new variables
• Shorten records by using codes instead of text
• Trim unnecessary decimal places
• Computationally light processes where possible
![Page 17: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/17.jpg)
Coding for Big Data: Hash ObjectAn Ordinary Customer ListName Street_Address City State Zip_Code prod_42 prod_44
Magnify Analytics 1 Kennedy Square Detroit MI 48226 4 3
Fedex Office 2609 Plymouth Road #7 Ann Arbor MI 48105 4 2
Hyatt Regency Minneapolis 1300 Nicollet Mall Minneapolis MN 55403 1 5
Wrigley Field 1060 W. Addison St Chicago IL 60613 2 3
.
.
The Same Data in a Hash TableHash_ID Zip_Code prod_42 prod_44
00042540 48226 4 3
00063640 48105 4 3
00146328 55403 4 3
00243466 60613 4 3
.
.
![Page 18: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/18.jpg)
Coding for Big Data: Hash ObjectThe Hash Object Process
y = w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5
1. Read the hash key for the given record
2. Look up the value of x1 by the key
3. Multiply by w1 and save it in a buffer
4. Repeat for each component of the model
5. Add all the components to calculate y
6. Release the buffer and go the next record
7. Repeat for each record
![Page 19: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/19.jpg)
Testing Model PerformanceThe Problem of p-values and Big Data
Explanatory Variable Estimate Pr ( > |z|)
Var1 0.271503909 > 0.001
Var2 0.998361223 > 0.001
. . .
. . .
Var25 0.244677914 > 0.001
Var26 0.387859652 > 0.001
. . .
. . .
Var100 0.561703993 > 0.001
Var101 0.479482516 0.002
Var102 0.35656757 0.003
![Page 20: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/20.jpg)
ASA Statement on p-values, 3/7/2016:
“The p-value was never intended to be a substitute
for scientific reasoning…Well-reasoned statistical
arguments contain much more than the value of a
single number and whether that number exceeds an
arbitrary threshold. The ASA statement is intended
to steer research into a ‘post p<0.05 era.”
Ron Wasserstein, ASA Executive Director
Testing Model Performance
![Page 21: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/21.jpg)
Testing Model PerformanceNew Statistical Tests for Big Data
• Bonferroni Correction
• False Discovery Rate
• False Coverage Rate
• PCER
• Bayesian, including Bayesian FCR
![Page 22: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/22.jpg)
Traditional Server Computing
SERVER
USER WORK STATIONS
Distributed Computing
![Page 23: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/23.jpg)
Traditional Server Computing
SERVER
USER WORK STATIONS
Need More Resources?
SERVER
USER WORK STATIONS
Distributed Computing
![Page 24: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/24.jpg)
Traditional Server Computing
SERVER
USER WORK STATIONS
Need More Resources? >> Get a Bigger Server
SERVER
USER WORK STATIONS
Distributed Computing
![Page 25: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/25.jpg)
Distributed Computing
Scalable Distributed Computing
USER WORK STATIONS
SERVER NODE NETWORK
![Page 26: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/26.jpg)
Distributed Computing
Scalable Distributed Computing
Need More Resources?
USER WORK STATIONS
SERVER NODE NETWORK
USER WORK STATIONS
SERVER NODE NETWORK
![Page 27: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/27.jpg)
Distributed Computing
Scalable Distributed Computing
USER WORK STATIONS
SERVER NODE NETWORK
USER WORK STATIONS
SERVER NODE NETWORK
Need More Resources? >> Add More Nodes
![Page 28: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/28.jpg)
Summary of Big Data Best Practices
• Use best practices for managing large data sets, with efficient coding
• Pre-screen records and variables, only processing the data needed
• Use sampling where appropriate
• Consider Hash Object Programming to apply scoring models to big data
• Learn and use multi-threaded and distributed statistical procedures
• Use tests for model performance that have been designed for big data
• Look into grid computing for large analytic systems
![Page 29: About the Presenter: David J Corliss David Corlis… · About the Presenter: David J Corliss •PhD"in"statistical"astrophysics1"formerly"part 5time" faculty"atWayne"State"University](https://reader034.vdocuments.us/reader034/viewer/2022050504/5f96643ee9b6cf4197708678/html5/thumbnails/29.jpg)
References and Additional MaterialsProgramming for Job Security, Arthur Carpenter and Tony Payne
http://www2.sas.com/proceedings/sugi23/Training/p275.pdf
Secrets of Efficient SAS® Coding Techniques
http://support.sas.com/resources/papers/proceedings16/11741-2016.pdf
The SAS Data Step: Where Your Input Mattershttp://www.pharmasug.org/proceedings/2012/TF/PharmaSUG-2012-TF04.pdf
Maximizing the Power of Hash Tables, David J Corliss
http://support.sas.com/resources/papers/proceedings13/037-2013.pdf