shortcourse on robust statistics

Upload: ittigoon-saithong

Post on 14-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 ShortCourse on Robust Statistics

    1/92

    A SHORT COURSE ON

    ROBUST STATISTICS

    David E. Tyler

    Rutgers

    The State University of New Jersey

    Web-Site

    www.rci.rutgers.edu/ dtyler/ShortCourse.pdf

  • 7/29/2019 ShortCourse on Robust Statistics

    2/92

    References

    Huber, P.J. (1981). Robust Statistics. Wiley, New York.

    Hampel, F.R., Ronchetti, E.M., Rousseeuw,P.J. and Stahel, W.A. (1986).

    Robust Statistics: The Approach Based on Influence Functions.

    Wiley, New York.

    Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006).Robust Statistics: Theory and Methods. Wiley, New York.

  • 7/29/2019 ShortCourse on Robust Statistics

    3/92

    PART 1

    CONCEPTS AND BASIC METHODS

  • 7/29/2019 ShortCourse on Robust Statistics

    4/92

    MOTIVATION Data Set: X1, X2, . . . , X n Parametric Model: F(x1, . . . , xn | )

    : Unknown parameter

    F: Known function

    e.g. X1, X2, . . . , X n i.i.d. Normal(, 2)Q: Is it realistic to believe we dont know (, 2), but we know e.g. the shape of the

    tails of the distribution?

    A: The model is assumed to be approximately true, e.g. symmetric and unimodal

    (past experience).

    Q: Are statistical methods which are good under the model reasonably good if themodel is only approximately true?

    ROBUST STATISTICS

    Formally addresses this issue.

  • 7/29/2019 ShortCourse on Robust Statistics

    5/92

    CLASSIC EXAMPLE: MEAN .vs. MEDIAN

    Symmetric distributions: = population meanpopulation median

    Sample mean: X Normal

    , 2

    n

    Sample median: Median Normal , 1n 14f()2 At normal: Median Normal

    ,

    2

    n2

    Asymptotic Relative Efficiency of Median to Mean

    ARE(Median, X) =avar(X)

    avar(Median)=

    2

    = 0.6366

  • 7/29/2019 ShortCourse on Robust Statistics

    6/92

    CAUCHY DISTRIBUTIONX Cauchy(, 2)

    f(x; , ) = 1

    1 + x

    21

    Mean: X Cauchy(, 2) Median Normal

    , 22

    4n

    ARE(Median, X) = or ARE(X,Median) = 0

    For t on degrees of freedom:

    ARE(Median, X) = 4( 2) ((+ 1)/2)

    2

    (/2)

    2 3 4 5ARE(Median, X) 1.621 1.125 0.960ARE(X,Median) 0 0.617 0.888 1.041

  • 7/29/2019 ShortCourse on Robust Statistics

    7/92

    MIXTURE OF NORMALSTheory of errors: Central Limit Theorem gives plausibility to normality.

    X

    Normal(, 2) with probability 1

    Normal(, (3)2) with probability

    i.e. not all measurements are equally precise.

    X (1

    ) Normal(, 2) + N ormal(, (3)2)

    = 0.10

    Classic paper: Tukey (1960), A survey of sampling from contaminated distributions. For > 0.10 ARE(M edian, X) > 1 The mean absolute deviation is more effiicent than the sample standard deviation

    for > 0.01.

  • 7/29/2019 ShortCourse on Robust Statistics

    8/92

    PRINCETON ROBUSTNESS STUDIESAndrews, et.al. (1972)

    Other estimates of location.

    -trimmed mean: Trim a proportion of from both ends of the data set and thentake the mean. (Throwing away data?)

    -Windsorized mean: Replace a proportion of from both ends of the data setby the next closest observation and then take the mean.

    Example: 2, 4, 5, 10, 200

    Mean = 44.2 Median = 5

    20% trimmed mean = (4 + 5 + 10) / 3 = 6.33

    20% Windsorized mean = (4 + 4 + 5 + 10 + 10) / 5 = 6.6

  • 7/29/2019 ShortCourse on Robust Statistics

    9/92

    Measuring the robustness of a statistics

    Relative Efficiency over a range of distributional models. There exist estimates of location which are asymptotically most efficient for

    the center of any symmetric distribution. (Adaptive estimation, semi-parametrics).

    Robust?

    Influence Function over a range of distributional models. Maximum Bias Function and the Breakdown Point.

  • 7/29/2019 ShortCourse on Robust Statistics

    10/92

    Measuring the effect of an outlier

    (not modeled)

    Good Data Set: x1, . . . , xn1Statistic: Tn1 = T(x1, . . . , xn1)

    Contaminated Data Set: x1, . . . , xn1, x

    Contaminated Value: Tn = T(x1, . . . , xn1, x)

    THE SENSITIVITY CURVE (Tukey, 1970)

    SCn(x) = n(Tn Tn1) Tn = Tn1 + 1n

    SCn(x)

    THE INFLUENCE FUNCTION (Hampel, 1969, 1974)

    Population version of the sensitivity curve.

  • 7/29/2019 ShortCourse on Robust Statistics

    11/92

    THE INFLUENCE FUNCTION

    Statistic Tn = T(Fn) estimates T(F). Consider F and the -contaminated distribution,F = (1 )F + x

    where x is the point mass distribution at x.

    F F

    Compare Functional Values: T(F) .vs. T(F) Given Qualitative Robustness (Continuity):

    T(F) T(F) as 0(e.g. the mode is not qualitatively robust)

    Influence Function (Infinitesimal pertubation: Gateaux Derivative)

    IF(x; T, F) = lim0

    T(F) T(F)

    =

    T(F)

    =0

  • 7/29/2019 ShortCourse on Robust Statistics

    12/92

    EXAMPLES

    Mean: T(F) = EF[X].T(F) = EF(X)

    = (1 )EF[X] + E[x]= (1

    )T[F] + x

    IF(x; T, F) = lim0

    (1 )T(F) + x T(F)

    = x T(F)

    Median: T(F) = F1(1/2)

    IF(x; T, F) = {2 f(T(F))}1sign(X T(F))

  • 7/29/2019 ShortCourse on Robust Statistics

    13/92

    Plots of Influence Functions

    Gives insight into the behavior of a statistic.

    Tn Tn1 + 1n

    IF(x; T, F)

    Mean Median

    -trimmed mean -Winsorized mean

    (somewhat unexpected?)

  • 7/29/2019 ShortCourse on Robust Statistics

    14/92

    Desirable robustness properties for the influence function

    SMALL Gross Error Sensitivity

    GES(T; F) = supx

    | IF(x; T, F) |GES < B-robust (Bias-robust)

    Asymptotic Variance

    Note : EF[ IF(X; T, F) ] = 0

    AV(T; F) = EF[ IF(X; T, F)2 ]

    Under general conditions, e.g. Frechet differentiability,

    n(T(X1, . . . , X n) T(F)) Normal(0, AV(T; F))

    Trade-off at the normal model: Smaller AV Larger GES SMOOTH (local shift sensitivity): protects e.g. against rounding error. REDESCENDING to 0.

  • 7/29/2019 ShortCourse on Robust Statistics

    15/92

    REDESCENDING INFLUENCE FUNCTION

    Example: Data Set of Male Heights in cm180, 175, 192, . . ., 185, 2020, 190, . . .

    Redescender = Automatic Outlier Detector

  • 7/29/2019 ShortCourse on Robust Statistics

    16/92

    CLASSES OF ESTIMATES

    L-statistics: Linear combination of order statistics

    Let X(1) . . . X(n) represent the order statistics.T(X1, . . . , X n) =

    ni=1

    ai,nX(i)

    where ai,n are constants.

    Examples: Mean. ai,n = 1/n Median.

    ai,n =

    1i = n+12

    0 i = n+12for n odd

    ai,n =

    12

    i = n2 ,n2 + 1

    0 otherwise

    for n even

    -trimmed mean.

    -Winsorized mean.

    General form for the influence function exists Can obtain any desirable monotonic shape, but not redescending Do not readily generalize to other settings.

  • 7/29/2019 ShortCourse on Robust Statistics

    17/92

    M-ESTIMATES

    Huber (1964, 1967)

    Maximum likelihood type estimates

    under non-standard conditions

    One-Parameter Case. X1, . . . , X n i.i.d. f(x; ),

    Maximum likelihood estimates Likelihood function. L( | x1, . . . , xn) =

    ni=1

    f(xi; )

    Minimize the negative log-likelihood.

    min

    n

    i=1(xi; ) where (xi; ) =

    log f(x : ).

    Solve the likelihood equations

    ni=1

    (xi; ) = 0 where (xi; ) =(xi; )

  • 7/29/2019 ShortCourse on Robust Statistics

    18/92

    DEFINITIONS OF M-ESTIMATES

    Objective function approach: = arg min ni=1 (xi; ) M-estimating equation approach: ni=1 (xi; ) = 0.

    Note: Unique solution when (x; ) is strictly monotone in .

    Basic examples. Mean. MLE for Normal: f(x) = (2)1/2e

    1

    2(x)2 for x .

    (x; ) = (x )2 or (x; ) = x

    Median. MLE for Double Exponential: f(x) =12e|

    x

    | for x .

    (x; ) = | x | or (x; ) = sign(x )

    and need not be related to any density or to each other.

    Estimates can be evaluated under various distributions.

  • 7/29/2019 ShortCourse on Robust Statistics

    19/92

    M-ESTIMATES OF LOCATION

    A symmetric and translation equivariant M-estimate.Translation equivariance

    Xi Xi + a Tn Tn + agives (x; t) = (x t) and (x; t) = (x t)

    Symmetric

    Xi Xi Tn Tngives (r) = (r) or (r) = (r).

    Alternative derivationGeneralization of MLE for center of symmetry for a given family of symmetric

    distributions.

    f(x; ) = g(|x |)

  • 7/29/2019 ShortCourse on Robust Statistics

    20/92

    INFLUENCE FUNCTION OF M-ESTIMATES

    M-FUNCTIONAL:T(F)

    is the solution toEF[(X; T(F))] = 0

    .

    IF(x; T, F) = c(T, F)(x; T(F))

    where c(t, f) = 1/EF

    (X;)

    evaluated at = T(F).

    Note: EF[ IF(X; T, F) ] = 0.

    One can decide what shape is desired for the Influence Function and then construct

    an appropriate M-estimate.

    Mean Median

  • 7/29/2019 ShortCourse on Robust Statistics

    21/92

    EXAMPLE

    Choose

    (r) =

    c r cr | r | < c

    c r cwhere c is a tuning constant.

    Hubers M-estimateAdaptively trimmed mean.

    i.e. the proportion trimmed depends upon the data.

  • 7/29/2019 ShortCourse on Robust Statistics

    22/92

    DERIVATION OF INFLUENCE FUNCTION FROM M-ESTIMATES

    Sketch

    Let T = T(F), and so

    0 = EF[(X; T)] = (1 )EF[(X; T)] + (x; T).Taking the derivative with respect to

    0 = EF[(X; T)] + (1 )

    EF[(X; T)] + (x; T) +

    (x; T).

    Let (x, ) = (x, )/. Using the chain rule

    0 = EF[(X; T)] + (1 )EF[(X; T)]T

    + (x; T) + (x; T)

    T

    .

    Letting 0 and using qualitative robustness, i.e. T(F) T(F), then gives0 = EF[

    (X; T)]IF(x; T(F)) + (x; T(F)) RESULTS.

  • 7/29/2019 ShortCourse on Robust Statistics

    23/92

    ASYMPTOTIC NORMALITY OF M-ESTIMATES

    Sketch

    Let = T(F). Using Taylor series expansion on (x; ) about gives0 =

    ni=1

    (xi;) = n

    i=1(xi; ) + (

    ) ni=1

    (xi; ) + . . . ,

    or

    0 =

    n{1n

    ni=1

    (xi; )} +

    n( ){1n

    ni=1

    (xi; )} + Op(1/

    n).

    By the CLT,n{1n

    ni=1 (xi; )} d Z Normal(0, EF[(X; )2]).

    By the WLLN, 1nn

    i=1 (xi; ) p EF[(X; )].

    Thus, by Slutskys theorem,

    n( ) d Z/EF[(X; )] Normal(0, 2),where

    2 = EF[(X; )2]/EF[

    (X; )]2 = EF[IF(X; T F)2]

    NOTE: Proving Frechet differentiability is not necessary.

  • 7/29/2019 ShortCourse on Robust Statistics

    24/92

    M-estimates of location

    Adaptively weighted means

    Recall translation equivariance and symmetry implies(x; t) = (x t) and (r) = (r).

    Express (r) = ru(r) and let wi = u(xi ), then0 =

    ni=1

    (xi ) =n

    i=1(xi )u(xi ) =

    ni=1

    wi {xi }

    =n

    i=1 wi xini=1 wi

    The weights are determined by the data cloud.

    Heavily Downweighted

  • 7/29/2019 ShortCourse on Robust Statistics

    25/92

    SOME COMMON M-ESTIMATES OF LOCATION

    Hubers M-estimate

    (r) =

    c r cr | r | < c

    c r c

    Given bound on the GES, it has maximum efficiency at the normal model. MLE for the least favorable distribution, i.e. symmetric unimodal model withsmallest Fisher Information within a neighborhood of the normal.

    LFD = Normal in the middle and double exponential in the tails.

  • 7/29/2019 ShortCourse on Robust Statistics

    26/92

    Tukeys Bi-weight M-estimate (or bi-square)

    u(r) =

    1 r2

    c2

    +

    2

    where a+ = max{0, a}

    Linear near zero Smooth (continuous second derivatives) Strongly redescending to 0 NOT AN MLE.

  • 7/29/2019 ShortCourse on Robust Statistics

    27/92

    CAUCHY MLE

    (r) =r/c

    (1 + r2/c2)

    NOT A STRONG REDESCENDER.

  • 7/29/2019 ShortCourse on Robust Statistics

    28/92

    COMPUTATIONS

    IRLS: Iterative Re-weighted Least Squares Algorithm.k+1 =

    ni=1 wi,kxin

    i=1 wi,k

    where wi,k = u(xi k) and o is any intial value.

    Re-weighted mean = One-step M-estimate1 =

    ni=1 wi,o xi

    ni=1 wi,o

    ,

    where o is a preliminary robust estimate of location, such as the median.

  • 7/29/2019 ShortCourse on Robust Statistics

    29/92

    PROOF OF CONVERGENCE

    Sketch

    k+1 k = ni=1 wi,kxini=1 wi,k

    k = ni=1 wi,k(xi k)ni=1 wi,k

    =ni=1 (xi k)n

    i=1 u(xi k)

    Note: If k > , then k > k+1 and k < , then k < k+1.

    Decreasing objective function

    Let (r) = o(r2) and suppose o(s) 0 and o(s) 0. Then

    ni=1

    (xi k+1) < ni=1

    (xi k).

    Examples:

    Mean o(s) = s Median o(s) =

    s

    Cauchy MLE o(s) = log(1 + s). Include redescending M-estimates of location.

    Generalization of EM algorithm for mixture of normals.

  • 7/29/2019 ShortCourse on Robust Statistics

    30/92

    PROOF OF MONOTONE CONVERGENCE

    Sketch

    Let ri,k = (xi k). By Taylor series with remainder term,o(r

    2i,k+1) = o(r

    2i,k) + (r

    2i,k+1 r2i,k)o(r2i,k) +

    1

    2(r2i,k+1 r2i,k)2o(r2i,)

    So,n

    i=1o(r2i,k+1)