data mining - wharton statistics department - statistics...
Post on 12-Apr-2018
221 Views
Preview:
TRANSCRIPT
-
Data Mining
Robert A Stine
stine @ wharton.upenn.edu
Dept of Statistics, Wharton School
University of Pennsylvania
St Gallen
June 2016
-
Introduction
syllabus materials topics
-
Wharton Department of Statistics
Welcome and IntroductionsBackground
Undergraduate political science and mathInterest in voting behavior and election polls
PhD in Statistics from Princeton Bootstrap resampling in time series analysis
Interest in data mining Research in variable selection for time series and large models with 1000s of predictors
Recent and on-going projects
Streaming variable selection
Spatial and sparse time series
Models for text
3
-
Wharton Department of Statistics
What is data mining?Insult when I learned statistics!
Make up results to suit your theory
Discovering predictive patterns in data
Data: typically large, loosely structured data sets
Seldom much better than hand-tuned regression
DM often faster and provides a useful diagnostic
Science
Testable claims
Tasks
Prediction and classification
Estimation and interpretation
4
-
Wharton Department of Statistics
Caution!
5
We find patterns in
surprising places with
fascinating consequences!
$28,000
grilled cheese!
-
Wharton Department of Statistics
Data Mining in Soc SciPoorly suited to social science?
Empiricism run wild, lack of theory or hypotheses
Post hoc inference
Response
Leverage technology
Tukey Cost of theory remains expensive compared to computing
Honest
A better match to what most do in common practice
Diagnostic
Have I missing something?
Deep connections
Multidimensional scaling, likelihood, modern regression
6
-
Wharton Department of Statistics
Course ObjectivesFor you to leave confident that you can
Recognize when data mining can help
Apply new techniques to your own data
Explain the intuition behind methods
Expand your knowledge of statistical methods
Big picture
Wide variety of tools for building models
Save time/energy for finding data, asking questions
More likely to happen if you ask questions here!
So dont hesitate...
7
-
Wharton Department of Statistics
TextbooksMachine learning
Modern term for statistical techniques designed for large amounts of complex data
Originated in Computer Science
An Introduction to Statistical Learning (ISL 2013) James, Witten, Hastie, Tibshirani
Examples in R for each chapter
Basis for much of the afternoon sessions
Elements of Statistical Learning (ESL 2009) Hastie, Tibshirani, and Friedman
More details, additional topics not covered in ISL
8
-
Wharton Department of Statistics
Plan for LecturesLoosely follow James text
Models and averaging Ch 1-4 Regression, bootstrap, classification
Feature selection Ch 4-6 Cross-validation, shrinkage, Lasso
Nonlinearity Ch 7, HO, ESL Ch 11 GAM, neural networks, networks, boosting
Tree-based methods Ch 8 CART, random forests, bagging
Practical modeling Ch 9-10, Readings Case studies, text, unsupervised features
9
Morning lecture
Afternoon R and JMP
-
Wharton Department of Statistics
RequirementsTaking the course for a grade?
Doing exercises useful follow up even if not
Daily exercises
James textbook
Start on during the afternoon lab sessions
Several for each class
Submit package within 2 weeks
Analysis in R and/or JMP
Explanation as called for in exercise
Appropriate use of graphics
10
-
Additional References and Software
-
Wharton Department of Statistics
Research Papers Fitting regressions, big models, inference
Chatfield (1995), JRSS Model Uncertainty, Data Mining and Stat Inference
Sala-I-Martin (1997), AER I Just Ran Two Million Regressions
Hand et al (2000), Statistical Science Data Mining for Fun and Profit
Foster and Stine (2004), JASA Variable Selection in Data Mining
Ward, Greenhil, and Bakke (2010) J of Peace Res Perils of policy by p-value: Predicting civil conflicts
Stine (1989) SMR Introduction to Bootstrap Methods
12
https://dl.dropboxusercontent.com/u/5425615/stgallenDM/sgdm_papers.tgz
-
Wharton Department of Statistics
Research Papers Neural networks
Zeng (1999) SMR Prediction and Classification with Neural Networks
Beck, King and Zeng (2000) APSR Improving Quant Studies of International Conflict
DeMarchi, Gelpi, and Grynaviski (2004) APSR Untangling Neural Nets
Trees
Berk (2006) SMR Introduction to Ensemble Methods for Data Analysis
Weerts and Ronca (2010) Ed Economics Using Classification Trees to Predict Alumni Giving
Kastellec (2010) J Empirical Legal Studies Stat Analysis of Judicial Decisions with Trees
13
-
Wharton Department of Statistics
Research PapersText analysis
Hopkins and King (2007) Extracting Social Science Meaning from Text
Hopkins and King (2010) AJPS A Method of Automated Nonparametric Content Analysis for Social Science
Monroe, Colaresi, and Quinn (2008) Pol Analysis Fightin' Words: Identifying Political Conflict
Turney and Pantel (2010) Freq to Meaning: Vector Space Models of Semantics
Grimmer and Stewart (2013) Text as Data: Promise and Pitfalls
ArXiv
14
-
Wharton Department of Statistics
SoftwareConsiderations
Interface
Type commands at prompt (R) vs point-and-click (JMP)
Data input and output
Ease of getting your data in and out
Extensibility
Ability to add your own customization
Graphics
Presentation (R) vs interactive (JMP)
Scope
Cutting (bleeding) edge vs ease of use (JMP)
Cost
Free (R) vs commercial (JMP from SAS, free trial)
Support, user community
Particularly within your own company or collaborators
Class
Blend of R (text) and JMP (interactive)
Alternatives from IBM, Stata...
15
You dont mine data by hand!
-
Wharton Department of Statistics
BooksPrinciples of Data Mining (2001) Hand, Mannila, and Smyth
Personal favorite, but dated
More like a textbook with blend of theory and practice, but limited examples.
Large number of background citations
Data Mining (3rd Ed, 2011) Witten, Frank, and Hall
Emphasis on trees, association rules and domain knowledge
Terse but comprehensive coverage, with terminology
Bound to the WEKA software, small examples
Data Mining Techniques (3rd Ed, 2011) Linoff and Berry
Business decision making emphasis (100+ pages intro)
Wide scope, with emphasis on business communication, visualization
Data mining cannot be bought, it must be mastered, Discusses privacy concerns
Data Mining with R: Case Studies (2011) Torgo
Introduction to R with five case studies
Smallish data: Algae blooms has 7 cases to predict.
Others cover stock trading scheme, fraud detection. Special coverage of microarrays
Includes SVM, MARS, along with basics
16
-
Wharton Department of Statistics
More BooksPractical Applications of Data Mining (2012) Suh
Eclectic topics, emphasis on association rules, fuzzy sets, neural networks. Less stat
Covers data retrieval, SQL, database management issues
Numerous detailed algorithm examples for small, in-text data tables
Data Mining (2011, 2nd Edition) Kantardzic
Broad coverage, from business aspects to ensemble learning to text mining
Usefully annotated bibliography, encyclopedic extent.
Precious little in the way of real examples.
Background and specific methods
Modern Multivariate Statistical Techniques (2008). Izenman
Principal Component Analysis (2002). Jolliffe
Independent Component Analysis (2004). Stone
Neural Networks for Pattern Recognition (1996). Bishop
Classification and Regression Trees (1984). Breiman, Friedman, Stone, Olshen
17
-
Data
What would be the ideal data for answering my questions?
-
Wharton Department of Statistics
Data Collection IssuesCollecting, organizing, cleaning
90% of work on any real task
Weak aspect of statistics education, training
Audit trail, value in scripting, reproducible
Common problems
Data are not random samples
Secondary source
Data gathered for other task (eg, accounting), repurposed for modeling
Merging data from multiple sources
Inconsistent labels, conventions
Changes over time in coverage, definitions
Mismatching cases from different sources
Missing data
Accuracy
Data entry errors
Mislabeled variable
Categories as numbers
19
-
Wharton Department of Statistics
Data TableTraditional rectangular layout
Rows are n cases
Columns are variables
Getting taller and wider these days
Often more columns than rows (Arcene)
When tall, much taller! (Income)
Most software uses tabular convention
Some customize, separating data into multiple files
Longitudinal data
Transactional data: customer history, patient records
Dependence exaggerates amount of data
20
var1 var2 vark
case1
case2
case3
casen
-
Wharton Department of Statistics
Tall Data TableClassical big data becoming less common...
Income data for targeted marketing from UCI
Goal: Identify high-income households based on location and descriptive characteristics
32,500 rows x 15 columns
US Census data on households
Banking data from UCI
Goal: Identify customers who will respond to solicitation for a term deposit program
45,000 rows x 18 columns
Portuguese bank study
Sparse response of 10% (fraud is much less)
21UCI = Univ of Ca Irving ML databases, http://archive.ics.uci.edu
-
Wharton Department of Statistics
Square Data TableElection survey
Expanding domain of questions balanced against the cost of more cases
American National Election Study (ANES) 2008
Goal: Explain why voters pick the candidate that they choose
2,000 registered voters (rows), with about the same number of features (columns)
All sorts of variables
age party affiliation open-ended responses
Complications
Missing data
Interviewer effects
Sampling weights, ...
22
-
Wharton Department of Statistics
Wide Data TableAutomation
Automated data collection allows much more extensive measurement
Text data is very wide
Arcene example from UCI
200 cases, positive/negative cancer
10,000 featuresmass spectrometer measurements
Challenge: Separate normal cells from cancerous cells (prostate, ovarian)
Complications galore
Easy to fit a model that can separate the observed cases perfectly because n < p
Hard to test out-of-sample because few cases
23UCI = Univ of Ca Irving ML databases, http://archive.ics.uci.edu
p = #possible explanatory variables
arcene
-
Wharton Department of Statistics
Some of BothFinancial data from Imperial
Defaults of customers
Sample of training data 25,000 cases, 770 features
Sparse: only 3,244 non-zero losses
Anonymous
Dont know definitions of features
Dont know which are categorical!
Huge amount of data
Training data alone is 488 MB
105,471 cases
Takes a while just to read data
24
-
Wharton Department of Statistics
Is big data really big?Sample from homogeneous population
Populations change over time
Household energy survey in US
Longitudinal data
Repeated measures
Instrumentation allow data on finer scale
Components replace averages, revealing noise
Time series
Financial and economic data
The data from Imperial are really a time series, but thats hidden
Assumption of stationarity
Lurking issue of dependence
25
-
Wharton Department of Statistics
Once Data Are Identified...Deal with the Vs of big data
Volume, velocity, variety, validity, ...
Data preparation often requires programming
Building variables from accounting, billing data
Example structure of mortgage loan recorded in data
Build new columns based on presence of certain strings in this column
Scripting tools
Perl, python, ruby, R, unix utilities (Makefiles)
Find a tool that you like to use and master it
26
-
Wharton Department of Statistics
Data from ISL
27R: Use str command to see structure of data set
-
Regression Models
Building block of all approaches
-
Preliminary Analysis
-
Wharton Department of Statistics
Getting StartedPlots remain important, even with millions
Experience in modeling bankruptcies
Outliers in text analysis
Sampling
Dont have to plot every observation
We know a lot about sampling
Objective
Trade-off: Pay now or pay later
Outliers, leverage points
Transformations
Combinations
30
-
Wharton Department of Statistics
Univariate Plots
Plot linking and brushing
Interactive tools add value to univariate plots
Things to look for
Scales, particularly when unfamiliar with data
Age is not the age of residents; its the proportion of homes built before 1940
Skewness, multimodal: square peg, round hole
Some methods automate conversion to symmetry
Outliers
Outliers matter, even in large data tables.
Scope
At least inspect variables of primary interest!
31
boston_from_R
Analyze > Distribution
-
Wharton Department of Statistics
Scatterplot MatrixVisual correlation matrix
Mix data types
Numerical/Categorical
Visual tables
Embellishments
Coloring subset
Subsampling
Interactive
Brushing
Plot linking
Weakness of R
32Graph>Scatterplot matrixSee Figure 3.6 of ISL
-
Wharton Department of Statistics
Rounded DataData has often been rounded
Are categories ratio-level data?
Plots can hide a great deal
Dithering, adding some random noise helps
33
caravans
JMP Software graphs categorical in grid
with dithering
-
Wharton Department of Statistics
Summary of ExploratoryData are key to any modeling
Take time to understand nuances of your data
Simple, interpretable plots
Substantive insight + exploration = new features
Transformations
Substantive combinations (eg, ratios)
Clustering
Plots remain important, even if many variables
Use samples to speed viewing (though may miss outliers)
34
Would be slow to plot 25,000 points at zero for Imperial data
-
Regression Methods
Quick review, emphasizing link to DM
Regression provides opportunity to study issues
from DM in familiar context
-
Wharton Department of Statistics
Why Emphasize Regression?Claim
Regression can match the predictive performance of black-box models...
Just need the right explanatory variables!
Opportunity for substantive insights
Regression is familiar
Recognize then fix problems
Shares problems with black-boxes
Foundation for understanding complex, opaque models
Familiarity allows improvements
Several given in Foster and Stine (2004)
36
-
Wharton Department of Statistics
Classical RegressionTwo-part model
Mean of response is linear in explanatory variables E(Y|X) = y|x = 0 + 1 X1 + + p Xp
Unexplained idealized iid random variation Var(Y-y|x) = 2
Discussion
No limits on the Xs powers, logs, products
Model assumes you know which variables to use; just a question of estimating s
Error assumptions reasonable if correctly specified CLT: sum of small, omitted contributions is normal
Interpretation becomes difficult as model grows37
-
Wharton Department of Statistics
Least SquaresAssuming know that yi = 0 + 1xi1 + pxip + i + Independent, equal variance
Data (n x 1) response vector Y (n x p) matrix of explanatory variables X
Fitted values = b0 + b1X1 + bpXp
OLS criterion minb (yi - i)2 b = (XX)-1XY
normal equations (yi - i)xi = 0
Goodness of fit R2 = SS()/SS(y)
Standard errors Var(b) = 2(XX)-1 var(b) = s2(XX)-1
38
var(b1) s2
n var(X)
Equation is KNOWN
-
Wharton Department of Statistics
Example: DepreciationHow should BMW set the lease price for cars in its 3-series of sedans?
Depends on residual value at end of lease
39
BMW-prices
-
Wharton Department of Statistics
Example: DepreciationHow should BMW set the lease price for cars in its 3-series of sedans?
Sample of 218 used BMW 3-series cars
Model includes mileage, age, and type of car as well as an interaction of age and type.
40
Effect-coded estimatesR2=71%, s=$2,469
-
Wharton Department of Statistics
Careful InterpretationWhat do these estimates tell us?
Regression concerns the comparison of means under different conditions
Highly beefed up two-sample testing procedure!
Association causation
Carefully read the ISL text discussion of the advertising example.
41
-
Wharton Department of Statistics
Whats an interaction?What are these terms in the model?
Dummy variables
Represent all but one category
Intercept represents reference category, left out grp
Effect coding
Intercept estimates an overall average
42
-
Wharton Department of Statistics
View in 3-DGeometry
Two numerical variables define a plane in 3D
43
-
Wharton Department of Statistics
Further VisualizationInteractive tools: Profiling
Model profiling, viewing properties of the fitted equation of the least squares fit.
44interactive view of equation
-
Wharton Department of Statistics
Linear Regression LinesWhats the word linear mean?That the coefficients can be estimated as weighted averages of the response Not that the model fits lines.
Curvature modeled by
Powers (such as squares of variables)
Interactions (products of variables)
Transformations (most often logs)
Examples
Interactions: Boston housing data
Transformations: diamond prices
45
-
Wharton Department of Statistics
InteractionsBoston housing dataClassic economic analysis of impact of pollution levels on housing values
Impact of social class and home sizes
Repeated from previous scatterplot matrix example
Note censoring of values, outliers, curvature
Marginal association consistent with partial slopes
46
boston_from_R
-
Wharton Department of Statistics
InteractionsInitial linear fit ignores the interaction between percentage Lower Status and Rooms.
Adding interaction term to model reveals a strong effect
Interpretation from coefficients?
From interactive profile view?
47
Howd you know to do that?
-
Wharton Department of Statistics
Response SurfaceAdd squares of features a response surface
Response surface
Common in industrial design to locate an optimal mixture.
Surface plot
shows model fit is not linear in initial features
Caution: Illustration only!
Model has several highlyleveraged outliers and omits relevant Xs
48
-
Wharton Department of Statistics
TransformationsRegression relies on appropriate scaling
Association may be highly nonlinear
Logs (ie, percentage change) often useful
Most stat tools wont know to do this!
49
Prices of 7,568
diamonds
log price 8.6 + 2 log(carat)
-
Wharton Department of Statistics
InferenceClassical inference
Relies on validity of the assumed model
Systematic approach
Test the overall fit of the model first
If reject H0, then proceed to individual effects.
50Why do this preliminary test?
BMW-prices
-
Wharton Department of Statistics
Collinearity and FHighly collinear features
Signficant separately, but not individually when together.
51
BMW-prices
-
Wharton Department of Statistics
InferenceTesting components of fitted model
Standard error variation of estimate from sample to sample
t-ratiocount of standard errors separating estimate from 0
p-value smallest level at which can reject null hypothesis. Assumes normal sampling distribution for slopes.
Statistical significance substantive importance
52
-
Wharton Department of Statistics
BootstrappingStandard error is key to inference
What are standard errors?
BS is alternative method for obtaining standard errors and confidence intervals
Estimates standard error by simulating sampling procedure
Simulates by sampling with replacement from observed distribution of data
Implementation
R bootstrap package - also easy to do yourself
Throughout JMP basic tools in JMP Pro
53
-
Wharton Department of Statistics
Bootstrap SamplingStandard error
Standard deviation of statistic
Repeated independent samples from the population
Bootstrap standard errors
Simulate standard error
Draw B samples from the observed sample itself.
Sampling with replacement
Collection of repeated statistics estimates sampling distribution
54
21 B...
SE
population
sampling dist
Sample
R scriptSee Figure 5.11, p 190
-
Wharton Department of Statistics
Bootstrap ExampleBootstrap regression in JMP
Illustrative model with Age and Mileage
Various conventions for representing categorical variables
Compare results of least squares to those found by resampling
BS finds - modest collinearity - normal samp. distribution - higher standard errors
55B=500
Least squares
-
Wharton Department of Statistics
OutliersRelevant even when modeling millions
CLT not a guarantee
Most of money often remains in the outliers. Hiding outliers conceals risks, variation, profits.
Complex models often result in sparse data
Sparse = most values are 0
e.g., interaction between two rare properties
Whats the p-valuefor the regression in this figure?
n = 10,000
x=0 ten 1sx=1 one 1
56
What would bootstrapping do?
-
Wharton Department of Statistics
Sandwich EstimatorConventional std errors are not robust
Assume the model is true, test adding Xp+1 to model
Math produces expression for variance
Sandwich estimators
Robustness of validity, even for dependence
Seldom implemented in software, particularly stepwise
57
heteroscedasticity
var(b) = (XX)-1XE(epep)X(XX)-1 = (XX)-1 XD2X (XX)-1
diagonal
dependence
var(b) = (XX)-1XE(epep)X(XX)-1 = (XX)-1 XBX (XX)-1
block diagonal
Var(b) = (XX)-1XE()X(XX)-1
= 2(XX)-1(XX) (XX)-1
sp+12(XX)-1
aka, White estimator
-
Wharton Department of Statistics
Better Standard ErrorHeteroscedastic error
Estimate standard error with outlier
Sandwich estimator allowing heteroscedastic error variances givesa t-stat 2, not 10.
Dependent error
Even more important need for accurate SE
Netflix contestBonferroni (or hard thresholding) overfits due to dependence in responses.
Credit modeling with 100,000s of cases Everything significant unless incorporate longitudinal dependence
58
R Script
-
Bias-Variance Tradeoff
The more we allow a model to adapt to data, the more easily it adapts to random noise.
-
Wharton Department of Statistics
Statistics = AveragingAverages estimate parameters and predict Just a question of which data to average.
Underlying rationale
Find to minimize E(Y-)2 given known features X
Best predictor is always conditional mean =E(Y|X)
Ideal data
Replications Large count of cases that share each relevant set of explanatory characteristics
Unlikely to find so many so similar
60
-
Wharton Department of Statistics
ExampleProblem: estimate an unknown function
Y = (x) + error
Data spread over x-axis from 0 to 1
Estimator
Average values of response for similar values of x
Divide x-axis from 0 to 1 into d bins j = ave(Y | (j-1)/d x j/d), j = 1,2, , d
61bias_variance.R
(x) = sin(2 x)
0.0 0.2 0.4 0.6 0.8 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Explanatory Variable
Response
n=100, d=5
0.0 0.2 0.4 0.6 0.8 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Explanatory Variable
Response
n=100, d=10
0.0 0.2 0.4 0.6 0.8 1.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
Explanatory Variable
Response
n=100, d=20
Whats the best choice for d, the number
of bins?
-
Wharton Department of Statistics
Trade-OffsBias versus variance
Large number of bins able to track changing values of (x) lower bias
Large number of bins imply few data values within a bin higher variance
62
10 20 30 40 50
0500
1000
1500
2000
Number of Bins
Squ
ared
Err
or
sigma= 0.25 , n= 100 Bias^2VarianceTotal
larger d
smaller bias
-
Wharton Department of Statistics
Trade-OffsBias versus variance
Large number of bins able to track changing values of (x) lower bias
Large number of bins imply few data values within a bin higher variance
63
larger d
smaller bias
10 20 30 40 50
0500
1000
1500
2000
Number of Bins
Squ
ared
Err
or
sigma= 0.25 , n= 100 Bias^2VarianceTotal
larger d
larger variance
-
Wharton Department of Statistics
Trade-OffsBias versus variance
Large number of bins able to track changing values of (x) lower bias
Large number of bins imply few data values within a bin higher variance
64
larger d
smaller bias
larger d
larger variance
10 20 30 40 50
0500
1000
1500
2000
Number of Bins
Squ
ared
Err
or
sigma= 0.25 , n= 100 Bias^2VarianceTotal
-
Wharton Department of Statistics
Looks Easy?Problem
You dont get to see that plot!
Residual SS
Residual variation gets smaller and smaller as the model complexity grows (ie, as more bins)
65
10 20 30 40 50
0.20
0.25
0.30
0.35
Number of Bins
Sqr
t Ave
rage
Squ
ared
Res
idua
l
sigma= 0.25 , n= 100
-
Wharton Department of Statistics
Harder StillOptimal number of bins (model complexity)
Depends on amount of random noise relative to the curvature of the underlying, unknown (x)
Implication
With less noise, need less averaging
But we dont know how much noise there is!
Look back at the prior slide to see that we dont know the level of noise
6610 20 30 40 50
0500
1000
1500
2000
Number of Bins
Squ
ared
Err
or
sigma= 0.15 , n= 100 Bias^2VarianceTotal
-
Wharton Department of Statistics
Harder StillOptimal number of bins (model complexity)
Depends on amount of random noise relative to the curvature of the underlying, unknown (x)
Implication
With less noise, need less averaging
But we dont know how much noise there is!
Look back at the prior slide to see that we dont know the level of noise
6710 20 30 40 50
0500
1000
1500
2000
Number of Bins
Squ
ared
Err
or
sigma= 0.25 , n= 100 Bias^2VarianceTotal
-
Wharton Department of Statistics
Harder StillOptimal number of bins (model complexity)
Depends on amount of random noise relative to the curvature of the underlying, unknown (x)
Implication
With less noise, need less averaging
But we dont know how much noise there is!
Look back at the prior slide to see that we dont know the level of noise
6810 20 30 40 50
0500
1000
1500
2000
Number of Bins
Squ
ared
Err
or
sigma= 0.35 , n= 100 Bias^2VarianceTotal
-
Wharton Department of Statistics
Bias-Variance ReviewModel
Data are independent observations of Y = (x) + , ~ N(0,2)
Choice
Rougher: Average nearby where |xi-xj| is small small bias, large variance
Smoother: Average those where |xi-xj| is large large bias, small variance
Best choice
Depends on the underlying mean function (x)
The smoother the mean function (small ), then the more benefit from averaging
69
-
Wharton Department of Statistics
Animated ExampleTrue model is very simple, Y = X + (0,22)
A diagonal line
Smoothing spline ISL, 7.5
Smooth curve that estimates Ave(Y|X)
Controllable degree of smoothness
70JMP version is
animated
Which fits obs data best?
Which predicts new data best?
-
Wharton Department of Statistics
Example Suggests ApproachComplex model has best fit to data Line R2 = 0.889 Smooth 0.918 Smooth 0.968
Average squared errors from true mean
71
line predicts held back data best
-
Wharton Department of Statistics
Averaging in DM: KNNIdeal predictor, ideal data
Best predictor is always conditional mean =E(Y|X)
Average cases with same explanatory characteristics
Practical version
Identify a subset of, say, d relevant characteristics
Average k cases that are similar
KNN K-nearest neighbors
= average of k nearest values (closest on X)
Need to pick k
Small k: adapts to local behavior
Large k: smoother with smaller variance (recall Var() = 2/n)
72
-
Wharton Department of Statistics
KNN ClassifierKNN classifier
The majority of your nearest k neighbors determines your estimated group
73Figure 2.14, p40
k=3
Note the importance of scales
-
Wharton Department of Statistics
Bias or Variance?Few neighbors (small k): flexible boundary, low bias, high variance
Many neighbors (large k): smooth boundary, high bias, low variance
74Figure 2.16, p41
dashed uses k = 10
n=200
cases,
100 each color
-
Wharton Department of Statistics
Best Choice?Simulated data, so we know the right answer
75Figure 2.17, p42fewer neighbors >
more flexible >
See example in lab
How far do you need to look to
find 10 neighbors?
-
Wharton Department of Statistics
ConnectionsData mining tools resemble KNN, smoothing
Adjustable fit, one or more tuning parameters
Flexible, adaptive computational algorithm
Consequence
Easy to fit random variation
Artificially good fit that predicts poorly
Traditional statistical summaries inappropriate
Emergent solution
Use reserved, held-back data to judge model
Essence of cross-validation
76
-
Missing Data
-
Wharton Department of Statistics
Missing DataAlways present
In medical example, 170 out of 1,200 are complete
Often informative
In bankruptcy model, half of predictors indicate presence of missing data
Is data ever missing at random?
Handle as part of the modeling process?
Offer a simple patch that requires few assumptions
Main idea
Done as a data preparation step
Add indicator column for missing values
Fill the missing value
78
JMP Pro does this
in some
platforms
-
Wharton Department of Statistics
Handle Missing by Adding VarsAdd another variable
Add indicator column for missing values
Fill the missing with average of those seen
Simple approach, fewer assumptions
Expands the domain of the feature search
Allows missing cases to behave differently
Conservative evaluation of variable
Part of the modeling process
Distinguish missing subsets only if predictive
Missing in a categorical variable: not a problem
Missing define another category
79
Only for Xs,
never Y
-
Wharton Department of Statistics
Example
80
Data frame with missing values
Filled in data with added indicator columns
missing_data.R
-
Wharton Department of Statistics
Example of ProcedureSimple regression, missing at random
Conservative: unbiased estimate, inflated SE
n=100, 0=0, 1=3
30% missing at random, 1=3
81
-10 -5 0 5 10
-40
-20
0
20
40
-10 -5 0 5 10
-40
-20
0
20
40
-10 -5 0 5 10
-40
-20
0
20
40
Est SEb0 -0.25 1b1 3.05 0.17
Complete
Est SEb0 -1.5 1.4b1 3.01 0.27
Filled In
-
Wharton Department of Statistics
Example of ProcedureSimple regression, not missing at random
Conservative: unbiased estimate, inflated SE
n=100, 0=0, 1=3
30% missing follow steeper line
82
Est SEb0 -0.02 2.6b1 2.82 0.44
Filled In
Requires robust variance estimate
-10 -5 0 5 10
-20
0
20
40
60
80
-10 -5 0 5 10
-20
0
20
40
60
80
-10 -5 0 5 10
-20
0
20
40
60
80
-10 -5 0 5 10
-20
0
20
40
60
80
-
Wharton Department of Statistics
Imperial ExampleVariable 331 is interesting, but has missing
Look at only those who default
Missing 11% among these
Regression results
Note results are conservative (smaller t)
Missingness is informative
Retain all cases for subsequent analysis
83
variables are anonymous
-
Calibration
A model should be right on average.
top related