thinking robustly. sampling distribution in order to examine the properties of a statistic we often...

40
Thinking Robustly

Upload: ethelbert-armstrong

Post on 18-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

Properties of a Statistic Sufficiency A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter For example, this property makes the mean more attractive as a measure of central tendency compared to the mode or median. Unbiasedness A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating. As one can see using the resampling procedure, the mean can be shown to be an unbiased estimator

TRANSCRIPT

Page 1: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Thinking Robustly

Page 2: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Sampling DistributionIn order to examine the properties of a

statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample.

We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.

Page 3: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Properties of a StatisticSufficiency

A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameterFor example, this property makes the mean more

attractive as a measure of central tendency compared to the mode or median.

UnbiasednessA statistic is said to be an unbiased estimator if its

expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating.As one can see using the resampling procedure, the mean

can be shown to be an unbiased estimator

Page 4: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Properties of a Statistic Efficiency

The efficiency of a statistic is reflected in the variance that is observed when one examines the statistic over independently chosen samples Standard error The smaller the variance, the more efficient the statistic is said to

be Resistance

The resistance of an estimator refers to the degree to which that estimate is effected by extreme values i.e. outliers

Small changes in the data result in only small changes in estimate

Finite-sample breakdown point Measure of resistance to contamination The smallest proportion of observations that, when altered

sufficiently, can render the statistic arbitrarily large or small Median = n/2 Trimmed mean = whatever the trimming amount is Mean = 1/n

Page 5: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Problematic DataTypical problems include violations of

normality and homoscedasticity, data have outliers etc.

Mean based methods can sometimes be ‘robust’ but only with respect to type I error and that does not hold for many situationsIt may increase or even drastically decrease

More serious is the bias, inefficiency, and the dramatic decrease in power (i.e. increase in type II error), leading to completely erroneous inferential conclusions and understatement of effects

Page 6: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Measures of Central TendencyWhat we want:A statistic whose standard error will not be

grossly affected with small departures from normality

Power to be comparable to that of mean and sd when dealing with a normal population

The value to be fairly stable when dealing with non-normal distributions

Two classes to speak of:Trimmed meanM-estimators

Page 7: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Trimmed mean You are very familiar with this in terms of the median, in which

essentially all but the middle value is trimmed But now we want to retain as much of the data for best

performance but enough to ensure resistance to outliers How much to trim?

About 20%, and that means from both sides Example: 15 values. .2 * 15 = 3, remove 3 largest and 3 smallest

Advantages In non-normal situations it will perform better than the mean We already know it will be resistant to outliers It will have a reduced standard error as well Even under normal situations it is only slightly less efficient than the

mean Windsorized means

Instead of trimming, change all values past the upper and lower ‘trimming’ point to the value preceding that point

X = 1 2 3 4 5 6 becomesX = 2 2 3 4 5 5

Page 8: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

M-estimatorsM-estimators are another robust measure of locationInvolves the notion of a ‘loss function’

Examples: If we want to minimize squared errors from a particular value the

result of the measure of central tendency will be the mean If we want to minimize absolute errors, the result will be a median

M-estimators are more mathematically complex, but we can get the gist in that less weight is given to values that are further away from ‘center’Think of a robust z-statistic calculated for the values and any

beyond K are downweighted or possibly ignoredDifferent M-Estimators give different weights for

deviating values

Page 9: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

M-estimatorsExample of the weighting compared to the

meanWith robust estimates of distance, weight

accordingly

Page 10: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Measures of VariabilityWe could also calculate trimmed and windsorized

variance/standard deviationIn fact, the winsorized typically performs better as a

variance measureA common robust measure of deviation from

center is the median absolute deviation from the medianMAD

For a normal distribution MAD = .67(s)However, the MAD is more efficient and resistant It also allows for a much better means of

detecting a univariate outlier than the usual 2s standardIf X > 2(MAD/.6745)

Page 11: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Inferential use of robust statisticsIn general, the approach will be the same

using robust statistics as we have with regular ones as far as hypothesis testing and interval estimation

Of particular concern will be estimating the standard error and the relative merits of the robust approach

Page 12: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

The Trimmed MeanConsider the typical t-statistic for the one

sample case

This will hold using a trimmed mean as well, except we will be using the remaining values after trimming and our inference will regard the population trimmed mean instead of the population mean

1X

Xt

s

t

t t

X

Xt

s

Page 13: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Robust example: Using trimmed means for an independent samples t-test

Calculating the variance for the group one mean

h refers to the n remaining after trimming, the variance is winsorized

Do the same for group 2Calculate the t statistic as follows

21 1

11 1

( 1)( 1)

wn sdh h

1 2 1 2

1 2

t t t tX Xt

d d

*Note that this formulation works for unequal sample sizes also

Page 14: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Robust example: Effect sizeAs Cohen’s d is a sample statistic, use the

appropriate data for the trimmed caseCalculate Cohen’s d with the non-trimmed

values (trimmed means and winsorized variance/sd)

Even with M-estimators the conceptual approach remains in effect as well

1 2ˆ ˆMest Mestrobust

bimid

d

Page 15: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

SummaryGiven the issues regarding means, variances

and inferences based on them, a robust approach is appropriate and preferred when dealing with outliers/non-normal dataIncreased powerMore accurate assessment of group tendencies and

differencesMore accurate assessment of effect size

If we want the best estimates and best inference given not-so-normal situations, standard methods simply don’t work too well

We now have the methods and computing capacity to take a more robust approach to data analysis, and should not be afraid to use them.

Page 16: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

J. W. Tukey (1979) “… just which

robust/resistant methods you use is not important – what is important is that you use some. It is perfectly proper to use both classical and robust/resistant methods routinely, and only worry when they differ enough to matter. But when they differ, you should think hard.”

None of you use robust

stats, I can tell just by looking

at you!

Page 17: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Intro to Robust Correlation

Page 18: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Effect of OutliersOutliers can artificially and dramatically

increase or decrease r

OptionsCompute r with and without outliersTransform variablesConduct robustified R!

For example, recode outliers as having more conservative scores (winsorize)

Page 19: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

What else?r is the starting point for any regression

and related methodBoth the slope and magnitude of

residuals are reflective of rR = 0 slope =0

As such a lone r is really more of a starting point for understanding the relationship between two variables, but as it is that foundation, we’d like to have a good assessment

Page 20: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Robust Approaches to CorrelationRank approachesWinsorizedPercentage Bend

Page 21: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Rank approaches: Spearman’s rho and Kendall’s tauSpearman’s rho is calculated using the same

formula as Pearson’s r, but when variables are in the form of ranksSimply rank the data availableX = 10 15 5 35 25 becomesX = 2 3 1 5 4Do this for X and Y and calculate r as normal

Kendall’s tau is a another rank based approach but the details of its calculation are different

For theoretical reasons it may be preferable to Spearman’s, but both should be consistent for the most part and perform better than Pearson’s r when dealing with non-normal data

Page 22: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Winsorized CorrelationAs mentioned before, Winsorizing data involves

changing some decided upon percentage of extreme scores to the value of the most extreme score (high and low) which is not Winsorized

Winsorize both X and Y values (without regard to each other) and compute Pearson’s r

This has the advantage over rank-based approaches since the nature of the scales of measurement remain unchanged

For theoretical reasons (recall some of our earlier discussion regarding the standard error for trimmed means) a Winsorized correlation would be preferable to trimmingThough trimming is preferable for group

comparisons

Page 23: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Methods Related to M-estimatorsThe percentage bend correlation utilizes the median

and a generalization of the MADA criticism of the Winsorized correlation is that the

amount of Winsorizing is fixed in advance rather than determined by the data, and the rpb* gets around that

While the details can get a bit technical, you can get some sense of what is going on by relying on what you know regarding the robust approach in general

With independent X and Y variables (i.e. r = 0) and under normality, the values of robust approaches to correlation will match the Pearson r

With nonnormal data, the robust approaches described guard against outliers on the respective X and Y variables while Pearson’s r does not

Page 24: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

ProblemWhile these alternative methods help us in

some sense, an issue remainsWhen dealing with correlation, we are not

considering the variables in isolationOutliers on one or the other variable, might

not be a bivariate outlierConversely what might be a bivariate

outlier may not contain values that are outliers for X or Y themselves

Page 25: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Global measures of associationMeasures are available that take into

account the bivariate nature of the situation

Minimum Volume Ellipsoid Estimator (MVE)Minimum Covariance Determinant

Estimator (MCD)

Page 26: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Minimum Volume Ellipsoid Estimator

Robust elliptic plot (relplot) Relplots are like scatterplot boxplots for our data where the

inner circle contains half the values and anything outside the outer circle would be considered an outlier

A strategy for robust estimation of correlation would be to find the ellipse with the smallest area that contains half the data points

Those points are then used to calculate the correlationThe MVE

Page 27: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Minimum Covariance Determinant Estimator The MCD is another alternative we might use and

involves the notion of a generalized variance, which is a measure of the overall variability among a cloud of pointsThe determinant of a matrix is the generalized variance

For the two variable situation

As we can see, as r is a measure of linear association, the more tightly the points are packed the larger it would be, and subsequently smaller the generalized variance would be

The MCD picks that half of the data which produces the smallest generalized variance, and calculates r from that

2 2 2 2(1 )g x ys s s R

Page 28: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Global measures of association

Note that both the MVE and MCD can be extended to situations with more than two variablesWe’d just be dealing with a larger matrix

Example using the Robust library in S-PlusOMG! Drop down menus even!

Page 29: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Remaining issues: CurvatureThe fact is that straight lines may not

capture the true storyWe may often fail to find noticeable

relationships because our r, whichever method of “Pearsonesque” one we choose, is trying to specify a linear relationship

There may still be a relationship, and a strong one, just more complex

Page 30: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

SummaryCorrelation, in terms of Pearson r, gives

us a sense of the strength of a linear association between two variables

One data point can render it a useless measure, as it is not robust to outliers

Measures which are robust are available, and some take into account the bivariate nature of the data

However, curvilinear relationships may exist, and we should examine the data to see if alternative explanations are viable

Page 31: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Intro to Robust Regression

Page 32: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Violating assumptionsUsual situationSlight problems may not result in much

change in type I errorHowever, type II will be a major concern

with even modest violationsWith multiple violations, type I may also

sufferAdditional assumptions will be made for

multiple predictors

Page 33: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

OutliersAs outliers can greatly influence r, they will

naturally influence any analysis using itDetecting and dealing with outliers is a

part of the process of regression analysisOne issue is distinguishing univariate vs.

multivariate outliersWhile a data point might be an outlier on a

variable, it may not be as far as the model goesConversely, what might be an outlier for the

model, might not have it’s individual variable values noted as outliers

Page 34: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Robust RegressionA single unusual point can greatly distort

the picture regarding the relationship among variables

Heteroscedasticity, even in ‘normal’ situations, inflates the standard error of estimate and decreases our estimate of R2

Nonnormality can hamper our ability to come up with useful interval measures for slopes

A couple examples can give you an idea of the general approach

Page 35: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Least Median of SquaresInstead of minimizing the sum of the

squared residuals, find the slope and intercept that minimizes the median of the squared residuals

While conceptually straigthforward, doesn’t seem to perform as well generally as other robust approaches

Page 36: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Least Trimmed SquaresThe least trimmed squares approach involves

trimming the smallest and largest residualsSo if h is the amount of values left after

trimming and

Then the goal would be to minimize the sum of the squared residuals of the remaining data

Note again that optimal trimming amount is about ~.2

2 2 2 21 2 3 ... hr r r r

Page 37: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Least Trimmed Absolute ValueSame approach, but rather than minimize

the trimmed squared residuals, we minimize the sum of the absolute residuals remaining after trimming

This may be preferable to LTS in heteroscedastic situations

Page 38: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Theil-Sen Estimator For any pair of data points (cases) regarding a

relationship between two variables, we can plot those 2 points, produce a line connecting them, and note its slopeE.g. if we had 4 data points we could calculate 6 slopesX = 1,2,3,4Y = 5,7,11,15

If each of those slopes is weighted by the squared difference in the X values for the appropriate points, the weighted average of all our slopes created would be the least squares slope for the modelE.g. Create a line for the points, (1,5) and (2,7)

Slope = 2Weight by (1-2)2

What if instead of a weighted average, the median of those slopes is chosen as our model slope estimate?

That in essence is the Theil-Sen estimator

Page 39: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

SummaryIn single predictor situations, alternatives are

available that perform well in ideal situations, and much better than the LS approach in othersTheil-Sen in particular

While we have kept to the single predictor, this will typically never be our research situation in using regression analysis

These methods can also be generalized to the multiple predictor setting, but their breakdown point (i.e. resistance advantage) decreases as more predictors enter into the equationSave for the recently developed ‘deepest regression line’

method which appears to maintain its breakdown point

Page 40: Thinking Robustly. Sampling Distribution In order to examine the properties of a statistic we often want to take repeated samples from some population

Summary Again we call on the Tukey suggestion, use the methods

for comparison to what you are seeing with traditional approaches

A general approach: Check for linearity

Perhaps using a smootherDefault for scatterplots using the R-commander menu

system If ok there, then use an estimator with a breakdown

point of about .2-.3, and compare with LS output If notable differences between LS and robust exist,

figure out why and determine which is more appropriate If assumptions are tenable and little difference between

LS and robust exists, feel comfortable going with the LS output