long live (big data-fied) statistics!matloff/bdvis/jsm.pdf · (big data-fied) statistics! norm...
TRANSCRIPT
![Page 1: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/1.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Long Live (Big Data-Fied) Statistics!
Norm MatloffUniversity of California at Davis
Joint Statistical MeetingsMontreal, August 4, 2013
![Page 2: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/2.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Prolog
Prolog
Where are we with Big Data?
• Role of statistics?
• Role of parallel computation?
• Interactions between the two?
![Page 3: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/3.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Prolog
Prolog
Where are we with Big Data?
• Role of statistics?
• Role of parallel computation?
• Interactions between the two?
![Page 4: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/4.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Where are we?
Attitudes and worries:
• “With Big Data, you don’t need inference methods.”
• “With Machine Learning, you don’t need statistics.”
• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).
![Page 5: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/5.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Where are we?
Attitudes and worries:
• “With Big Data, you don’t need inference methods.”
• “With Machine Learning, you don’t need statistics.”
• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).
![Page 6: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/6.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Where are we?
Attitudes and worries:
• “With Big Data, you don’t need inference methods.”
• “With Machine Learning, you don’t need statistics.”
• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).
![Page 7: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/7.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Where are we?
Attitudes and worries:
• “With Big Data, you don’t need inference methods.”
• “With Machine Learning, you don’t need statistics.”
• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).
![Page 8: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/8.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Where are we?
Attitudes and worries:
• “With Big Data, you don’t need inference methods.”
• “With Machine Learning, you don’t need statistics.”
• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).
![Page 9: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/9.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.
![Page 10: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/10.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.
![Page 11: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/11.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.
![Page 12: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/12.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods.
And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.
![Page 13: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/13.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.
![Page 14: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/14.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.
![Page 15: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/15.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away.
Impossibleto understand without stat.
![Page 16: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/16.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Yes, stat is still needed!
Actually, stat is needed more than ever, e.g.:
• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.
• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.
• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.
• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.
![Page 17: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/17.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.
![Page 18: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/18.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.
![Page 19: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/19.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.
![Page 20: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/20.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.
![Page 21: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/21.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.
![Page 22: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/22.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)
Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.
![Page 23: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/23.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?
This talk will contain one “big n” section and two “big p”section.
![Page 24: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/24.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Setting
Consider the classical (though not universal) data format:
• n observations/cases/instances/...
• p variables/features/attributes/...
• Assumed i.i.d.
(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.
![Page 25: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/25.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big n
Part I: Big-n Problem
![Page 26: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/26.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big n
Part I: Big-n Problem
![Page 27: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/27.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big-n can be handled
Big n can generally be handled:
• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”
• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.
• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)
![Page 28: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/28.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big-n can be handled
Big n can generally be handled:
• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”
• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.
• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)
![Page 29: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/29.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big-n can be handled
Big n can generally be handled:
• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”
• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.
• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)
![Page 30: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/30.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big-n can be handled
Big n can generally be handled:
• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”
• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.
• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)
![Page 31: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/31.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big-n can be handled
Big n can generally be handled:
• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”
• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.
• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel.
(“Software alchemy.”)
![Page 32: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/32.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big-n can be handled
Big n can generally be handled:
• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”
• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.
• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)
![Page 33: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/33.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 34: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/34.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 35: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/35.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 36: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/36.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.
• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 37: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/37.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.
• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 38: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/38.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 39: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/39.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 40: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/40.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 41: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/41.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup.
E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 42: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/42.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads.
Can be faster even for just one core.
![Page 43: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/43.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Chunk Averaging Method
• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).
• Example: Regression.
• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.
• Produces statistically equivalent results for large n.
• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.
• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.
![Page 44: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/44.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Part II
Part II: A New GraphicalApproach to Big-p
Problem
![Page 45: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/45.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Part II
Part II: A New GraphicalApproach to Big-p
Problem
![Page 46: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/46.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Issues:
• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.
• Can’t display all at once, but try to get at least several.
• Problems:
• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the
screen become solid black.
![Page 47: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/47.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Issues:
• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.
• Can’t display all at once, but try to get at least several.
• Problems:
• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the
screen become solid black.
![Page 48: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/48.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Issues:
• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.
• Can’t display all at once, but try to get at least several.
• Problems:
• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the
screen become solid black.
![Page 49: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/49.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Issues:
• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.
• Can’t display all at once, but try to get at least several.
• Problems:
• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the
screen become solid black.
![Page 50: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/50.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Issues:
• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.
• Can’t display all at once, but try to get at least several.
• Problems:
• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the
screen become solid black.
![Page 51: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/51.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Issues:
• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.
• Can’t display all at once, but try to get at least several.
• Problems:
• Displaying > 2 variables on a 2-dimensional device.
• “Black screen problem”—with big n, at least parts of thescreen become solid black.
![Page 52: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/52.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Issues:
• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.
• Can’t display all at once, but try to get at least several.
• Problems:
• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the
screen become solid black.
![Page 53: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/53.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 54: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/54.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 55: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/55.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 56: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/56.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours:
Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 57: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/57.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.
Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 58: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/58.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing.
But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 59: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/59.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 60: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/60.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates:
Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 61: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/61.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.
Hard to understand noncontiguous axes, black screenproblem.
![Page 62: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/62.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Graphing Lots of Variables
Some existing methods:
• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.
• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.
![Page 63: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/63.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 64: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/64.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:
nonparametric curve estimation.Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 65: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/65.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:nonparametric curve estimation.
Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 66: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/66.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 67: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/67.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 68: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/68.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 69: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/69.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 70: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/70.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Statistics to the rescue!
An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.
• Ordinary plot would fill the screen.
• Solution: Draw the nonpar. 2-dim. density estimateinstead.
• That makes 3 dimensions, but code third dimension(density height) via color.
• E.g. scatterSmooth() in R.
![Page 71: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/71.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.! (No perspective plotting.)
![Page 72: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/72.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:
Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.! (No perspective plotting.)
![Page 73: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/73.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.! (No perspective plotting.)
![Page 74: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/74.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.! (No perspective plotting.)
![Page 75: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/75.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.! (No perspective plotting.)
![Page 76: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/76.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.! (No perspective plotting.)
![Page 77: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/77.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.!
(No perspective plotting.)
![Page 78: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/78.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Displaying 3 Vars. in 2 Dims.
The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.
• Plot regression function of Z, color coded, against X andY.
• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).
• Use nonpar. estimation, e.g. nearest-neighbor.
• 3 vars. in 2 dims.! (No perspective plotting.)
![Page 79: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/79.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 80: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/80.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims.,
based on “boundary curves.”First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 81: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/81.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”
First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 82: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/82.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 83: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/83.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 84: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/84.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 85: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/85.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 86: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/86.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Proposed Boundary Method
I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.
• For user-chosen b, boundary is the set
{(s, t) : E (Z |X = s,Y = t) = b} (1)
• User might set b = E(Z) (overall, unconditional mean).
• Plot estimate of the boundary curve.
• Displaying 3 vars. in 2 dims.
![Page 87: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/87.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Boundary Plot Example
• Bank account data, UCI repository.
• X = age of customer, Y = current bank account, Z = sayYes to open new type of account
• b = EZ = P(Z = 1)
![Page 88: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/88.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Boundary Plot Example
• Bank account data, UCI repository.
• X = age of customer, Y = current bank account, Z = sayYes to open new type of account
• b = EZ = P(Z = 1)
![Page 89: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/89.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Boundary Plot Example
• Bank account data, UCI repository.
• X = age of customer, Y = current bank account, Z = sayYes to open new type of account
• b = EZ = P(Z = 1)
![Page 90: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/90.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Boundary Plot Example
• Bank account data, UCI repository.
• X = age of customer, Y = current bank account, Z = sayYes to open new type of account
• b = EZ = P(Z = 1)
![Page 91: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/91.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Bank Example
• Above linemeans,above-avg.prob. sign upfor newaccount.
• Near retire ⇒“hardest sell”!
• Those around60 need alarge balancebefore willingto try newaccount.
![Page 92: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/92.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Bank Example
• Above linemeans,above-avg.prob. sign upfor newaccount.
• Near retire ⇒“hardest sell”!
• Those around60 need alarge balancebefore willingto try newaccount.
![Page 93: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/93.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Bank Example
• Above linemeans,above-avg.prob. sign upfor newaccount.
• Near retire ⇒“hardest sell”!
• Those around60 need alarge balancebefore willingto try newaccount.
![Page 94: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/94.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Bank Example
• Above linemeans,above-avg.prob. sign upfor newaccount.
• Near retire ⇒“hardest sell”!
• Those around60 need alarge balancebefore willingto try newaccount.
![Page 95: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/95.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Bank Example
• Above linemeans,above-avg.prob. sign upfor newaccount.
• Near retire ⇒“hardest sell”!
• Those around60 need alarge balancebefore willingto try newaccount.
![Page 96: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/96.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
More Than 3 Vars. in 2 Dims.
• Plotting boundaries has been done before.
• But the idea here is to display several boundaries at once,so as to display more variables in one 2-dim. graph.
![Page 97: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/97.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
More Than 3 Vars. in 2 Dims.
• Plotting boundaries has been done before.
• But the idea here is to display several boundaries at once,so as to display more variables in one 2-dim. graph.
![Page 98: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/98.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
More Than 3 Vars. in 2 Dims.
• Plotting boundaries has been done before.
• But the idea here is to display several boundaries at once,so as to display more variables in one 2-dim. graph.
![Page 99: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/99.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Adult Data
• UCI Adult data
• X = age, Y = education, Z = high income
• But now add a 4th variable: W = gender
• Plot 2 boundary curves, one male and one female.
• Thus display 4 variables in 2 dims.
![Page 100: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/100.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Adult Data
• UCI Adult data
• X = age, Y = education, Z = high income
• But now add a 4th variable: W = gender
• Plot 2 boundary curves, one male and one female.
• Thus display 4 variables in 2 dims.
![Page 101: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/101.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Adult Data
• UCI Adult data
• X = age, Y = education, Z = high income
• But now add a 4th variable: W = gender
• Plot 2 boundary curves, one male and one female.
• Thus display 4 variables in 2 dims.
![Page 102: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/102.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Adult Data
• UCI Adult data
• X = age, Y = education, Z = high income
• But now add a 4th variable: W = gender
• Plot 2 boundary curves, one male and one female.
• Thus display 4 variables in 2 dims.
![Page 103: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/103.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Adult Data
• UCI Adult data
• X = age, Y = education, Z = high income
• But now add a 4th variable: W = gender
• Plot 2 boundary curves, one male and one female.
• Thus display 4 variables in 2 dims.
![Page 104: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/104.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Adult Data
• UCI Adult data
• X = age, Y = education, Z = high income
• But now add a 4th variable: W = gender
• Plot 2 boundary curves, one male and one female.
• Thus display 4 variables in 2 dims.
![Page 105: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/105.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Adult Example
• Above linemeans,higher-than-avg. prob. ofhigh income.
• Before age35, not muchdifference.
• After age 35,women needmuch moreeducationthan men tolikely havehigh income.
![Page 106: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/106.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Adult Example
• Above linemeans,higher-than-avg. prob. ofhigh income.
• Before age35, not muchdifference.
• After age 35,women needmuch moreeducationthan men tolikely havehigh income.
![Page 107: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/107.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Adult Example
• Above linemeans,higher-than-avg. prob. ofhigh income.
• Before age35, not muchdifference.
• After age 35,women needmuch moreeducationthan men tolikely havehigh income.
![Page 108: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/108.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Adult Example
• Above linemeans,higher-than-avg. prob. ofhigh income.
• Before age35, not muchdifference.
• After age 35,women needmuch moreeducationthan men tolikely havehigh income.
![Page 109: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/109.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Adult Example
• Above linemeans,higher-than-avg. prob. ofhigh income.
• Before age35, not muchdifference.
• After age 35,women needmuch moreeducationthan men tolikely havehigh income.
![Page 110: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/110.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Flight Lateness
• Airline lateness data.
• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.
• 3 curves, one for each airport
• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.
• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”
![Page 111: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/111.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Flight Lateness
• Airline lateness data.
• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.
• 3 curves, one for each airport
• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.
• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”
![Page 112: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/112.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Flight Lateness
• Airline lateness data.
• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH),
so again,displaying 4 variables in 2 dims.
• 3 curves, one for each airport
• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.
• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”
![Page 113: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/113.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Flight Lateness
• Airline lateness data.
• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.
• 3 curves, one for each airport
• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.
• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”
![Page 114: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/114.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Flight Lateness
• Airline lateness data.
• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.
• 3 curves, one for each airport
• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.
• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”
![Page 115: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/115.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Example: Flight Lateness
• Airline lateness data.
• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.
• 3 curves, one for each airport
• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.
• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”
![Page 116: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/116.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Airline Example
• Above linemeans,higher-than-avg. meandelay.
• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.
![Page 117: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/117.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Airline Example
• Above linemeans,higher-than-avg. meandelay.
• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.
![Page 118: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/118.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Airline Example
• Above linemeans,higher-than-avg. meandelay.
• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.
![Page 119: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/119.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Airline Example
• Above linemeans,higher-than-avg. meandelay.
• SFO seems tobe doingbetter.
Needa very longflight to haveabove-avg.delay, relativeto the others.
![Page 120: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/120.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Airline Example
• Above linemeans,higher-than-avg. meandelay.
• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.
![Page 121: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/121.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Computation
• R’s (contour() not used (don’t want “islands”).
• Estimate regression (via fast kNN, FNN library).
• Find “boundary band,” all points near the estimateboundary.
• Smooth the band.
![Page 122: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/122.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Computation
• R’s (contour() not used (don’t want “islands”).
• Estimate regression (via fast kNN, FNN library).
• Find “boundary band,” all points near the estimateboundary.
• Smooth the band.
![Page 123: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/123.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Computation
• R’s (contour() not used (don’t want “islands”).
• Estimate regression (via fast kNN, FNN library).
• Find “boundary band,” all points near the estimateboundary.
• Smooth the band.
![Page 124: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/124.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Computation
• R’s (contour() not used (don’t want “islands”).
• Estimate regression (via fast kNN, FNN library).
• Find “boundary band,” all points near the estimateboundary.
• Smooth the band.
![Page 125: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/125.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Computation
• R’s (contour() not used (don’t want “islands”).
• Estimate regression (via fast kNN, FNN library).
• Find “boundary band,” all points near the estimateboundary.
• Smooth the band.
![Page 126: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/126.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Parallel Computation
Computation can be voluminous.
• Parallel processing.
• Take advantage of superlinearity from CAM.
• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.
• The“A” part of CAM comes in the smoothing of the band.
![Page 127: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/127.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Parallel Computation
Computation can be voluminous.
• Parallel processing.
• Take advantage of superlinearity from CAM.
• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.
• The“A” part of CAM comes in the smoothing of the band.
![Page 128: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/128.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Parallel Computation
Computation can be voluminous.
• Parallel processing.
• Take advantage of superlinearity from CAM.
• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.
• The“A” part of CAM comes in the smoothing of the band.
![Page 129: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/129.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Parallel Computation
Computation can be voluminous.
• Parallel processing.
• Take advantage of superlinearity from CAM.
• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.
• The“A” part of CAM comes in the smoothing of the band.
![Page 130: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/130.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Parallel Computation
Computation can be voluminous.
• Parallel processing.
• Take advantage of superlinearity from CAM.
• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.
• The“A” part of CAM comes in the smoothing of the band.
![Page 131: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/131.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Parallel Computation
Computation can be voluminous.
• Parallel processing.
• Take advantage of superlinearity from CAM.
• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.
• The“A” part of CAM comes in the smoothing of the band.
![Page 132: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/132.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Part III
Part III: Big p and theCurse of DimensionalityExorcizing the Curse of Dimensionality
Some small steps in that direction.
![Page 133: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/133.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Part III
Part III: Big p and theCurse of DimensionalityExorcizing the Curse of DimensionalitySome small steps in that direction.
![Page 134: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/134.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big p
• Theoretical considerations imply that should have p <√n
in regression case (Portnoy, 1968).
• Yet today p >> n is commonplace.
• This causes “multiple inference” problems (e.g. familywiseerror rates).
• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.
• And the ever-present Curse of Dimensionality.
![Page 135: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/135.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big p
• Theoretical considerations imply that should have p <√n
in regression case (Portnoy, 1968).
• Yet today p >> n is commonplace.
• This causes “multiple inference” problems (e.g. familywiseerror rates).
• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.
• And the ever-present Curse of Dimensionality.
![Page 136: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/136.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big p
• Theoretical considerations imply that should have p <√n
in regression case (Portnoy, 1968).
• Yet today p >> n is commonplace.
• This causes “multiple inference” problems (e.g. familywiseerror rates).
• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.
• And the ever-present Curse of Dimensionality.
![Page 137: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/137.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big p
• Theoretical considerations imply that should have p <√n
in regression case (Portnoy, 1968).
• Yet today p >> n is commonplace.
• This causes “multiple inference” problems (e.g. familywiseerror rates).
• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.
• And the ever-present Curse of Dimensionality.
![Page 138: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/138.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big p
• Theoretical considerations imply that should have p <√n
in regression case (Portnoy, 1968).
• Yet today p >> n is commonplace.
• This causes “multiple inference” problems (e.g. familywiseerror rates).
• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.”
I.e., Big n not big after all.
• And the ever-present Curse of Dimensionality.
![Page 139: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/139.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big p
• Theoretical considerations imply that should have p <√n
in regression case (Portnoy, 1968).
• Yet today p >> n is commonplace.
• This causes “multiple inference” problems (e.g. familywiseerror rates).
• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.
• And the ever-present Curse of Dimensionality.
![Page 140: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/140.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Big p
• Theoretical considerations imply that should have p <√n
in regression case (Portnoy, 1968).
• Yet today p >> n is commonplace.
• This causes “multiple inference” problems (e.g. familywiseerror rates).
• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.
• And the ever-present Curse of Dimensionality.
![Page 141: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/141.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Principle Components Analysis
• What sizes of p relative to n might be problematic forPCA?
• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np
= p(p-1)/2.• So, have problem if p > 2n, roughly.
![Page 142: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/142.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Principle Components Analysis
• What sizes of p relative to n might be problematic forPCA?
• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np
= p(p-1)/2.• So, have problem if p > 2n, roughly.
![Page 143: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/143.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Principle Components Analysis
• What sizes of p relative to n might be problematic forPCA?
• Sample covariance matrix V has p(p-1)/2 distinct entries.
• Data matrix has np entries.• So V is completely determined (except roundoff error) if np
= p(p-1)/2.• So, have problem if p > 2n, roughly.
![Page 144: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/144.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Principle Components Analysis
• What sizes of p relative to n might be problematic forPCA?
• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.
• So V is completely determined (except roundoff error) if np= p(p-1)/2.
• So, have problem if p > 2n, roughly.
![Page 145: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/145.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Principle Components Analysis
• What sizes of p relative to n might be problematic forPCA?
• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np
= p(p-1)/2.
• So, have problem if p > 2n, roughly.
![Page 146: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/146.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Principle Components Analysis
• What sizes of p relative to n might be problematic forPCA?
• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np
= p(p-1)/2.• So, have problem if p > 2n, roughly.
![Page 147: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/147.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
PCA Experiment
Simulation experiment:
• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.
• First PC should be (1,0,0,...) or (0,1,0,...).
> s imfunct ion ( n , p ) {
y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−
cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(
abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}
![Page 148: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/148.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
PCA Experiment
Simulation experiment:
• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.
• First PC should be (1,0,0,...) or (0,1,0,...).
> s imfunct ion ( n , p ) {
y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−
cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(
abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}
![Page 149: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/149.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
PCA Experiment
Simulation experiment:
• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.
• First PC should be (1,0,0,...) or (0,1,0,...).
> s imfunct ion ( n , p ) {
y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−
cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(
abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}
![Page 150: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/150.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
PCA Experiment
Simulation experiment:
• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.
• First PC should be (1,0,0,...) or (0,1,0,...).
> s imfunct ion ( n , p ) {
y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−
cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(
abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}
![Page 151: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/151.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
PCA Experiment
Simulation experiment:
• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.
• First PC should be (1,0,0,...) or (0,1,0,...).
> s imfunct ion ( n , p ) {
y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−
cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(
abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}
![Page 152: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/152.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Simulation, cont’d.
Return value from sim() should be 1 or 2. Let’s see:
> s im ( 5 0 0 , 4 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 439> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 1> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 905
![Page 153: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/153.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Simulation, cont’d.
Return value from sim() should be 1 or 2. Let’s see:
> s im ( 5 0 0 , 4 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 439> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 1> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 905
![Page 154: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/154.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Simulation, cont’d.
• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.
• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.
• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”
![Page 155: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/155.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Simulation, cont’d.
• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.
• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.
• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”
![Page 156: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/156.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Simulation, cont’d.
• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.
• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.
• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”
![Page 157: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/157.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Simulation, cont’d.
• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.
• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.
• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”
![Page 158: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/158.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 159: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/159.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 160: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/160.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier:
Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 161: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/161.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 162: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/162.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 163: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/163.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 164: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/164.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.
•√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 165: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/165.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 166: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/166.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Exorcizing the Curse?
• The term curse of dimensionality goes back 50 years.
• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!
• So, e.g., nearest-neighbor methods look iffy.
• My own rough derivation:
• Suppose the p distance components are iid.•
√Var(distance)/E (distance)→ 0 as p− >∞
• So, distances are approximately constant.
![Page 167: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/167.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Some Hope
Some Hope:
• But all that involves equally-weighted components indistance.
• Yet, arguably we should have weights, according toimportance of the variables.
• Then the above problem goes away. (Coef. of var. doesnot go to 0.)
• But how set the weights?
• Stay tuned...
![Page 168: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/168.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Some Hope
Some Hope:
• But all that involves equally-weighted components indistance.
• Yet, arguably we should have weights, according toimportance of the variables.
• Then the above problem goes away. (Coef. of var. doesnot go to 0.)
• But how set the weights?
• Stay tuned...
![Page 169: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/169.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Some Hope
Some Hope:
• But all that involves equally-weighted components indistance.
• Yet, arguably we should have weights, according toimportance of the variables.
• Then the above problem goes away. (Coef. of var. doesnot go to 0.)
• But how set the weights?
• Stay tuned...
![Page 170: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/170.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Some Hope
Some Hope:
• But all that involves equally-weighted components indistance.
• Yet, arguably we should have weights, according toimportance of the variables.
• Then the above problem goes away. (Coef. of var. doesnot go to 0.)
• But how set the weights?
• Stay tuned...
![Page 171: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/171.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Some Hope
Some Hope:
• But all that involves equally-weighted components indistance.
• Yet, arguably we should have weights, according toimportance of the variables.
• Then the above problem goes away. (Coef. of var. doesnot go to 0.)
• But how set the weights?
• Stay tuned...
![Page 172: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/172.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Some Hope
Some Hope:
• But all that involves equally-weighted components indistance.
• Yet, arguably we should have weights, according toimportance of the variables.
• Then the above problem goes away. (Coef. of var. doesnot go to 0.)
• But how set the weights?
• Stay tuned...
![Page 173: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/173.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Misc.
Online materiasl:The visualization code is available for your use andcomments/suggestions:http://heather.cs.ucdavis.edu/BigDataVis.html
These slides are there too.
Acknowlegements:The author would like to thank Noah Gift, Marnie Dunsmore,Nicholas Lewin-Koh, and David Scott for helpful discussions,and Hao Chen and Bill Hsu for use of their high-performancecomputing equipment.
![Page 174: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big](https://reader034.vdocuments.us/reader034/viewer/2022051605/60094306be1917236b7ade54/html5/thumbnails/174.jpg)
Long Live(Big
Data-Fied)Statistics!
Norm MatloffUniversity ofCalifornia at
Davis
Misc.
Online materiasl:The visualization code is available for your use andcomments/suggestions:http://heather.cs.ucdavis.edu/BigDataVis.html
These slides are there too.
Acknowlegements:The author would like to thank Noah Gift, Marnie Dunsmore,Nicholas Lewin-Koh, and David Scott for helpful discussions,and Hao Chen and Bill Hsu for use of their high-performancecomputing equipment.