intro bootstrap

Upload: michalaki-xrisoula

Post on 04-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Intro Bootstrap

    1/18

    Introduction to the Bootstrap

    Machelle D. Wilson

  • 7/29/2019 Intro Bootstrap

    2/18

    Outline

    Why the Bootstrap?

    Limitations of traditional statistics

    How Does it Work?

    The Empirical Distribution Function and the Plug-inPrinciple

    Accuracy of an estimate: Bootstrap standard errorand confidence intervals

    Examples

    How Good is the Bootstrap?

  • 7/29/2019 Intro Bootstrap

    3/18

    Limitations of Traditional Statistics:Problems with distributional assumptions

    Often data can not safely be assumed to befrom an identifiable distribution.

    Sometimes the distribution of the statistic

    is mathematically intractable, evenassuming that distributional assumptionscan be made.

    Hence, often the bootstrap provides asuperior alternative to parametricstatistics.

  • 7/29/2019 Intro Bootstrap

    4/18

    An example data set

    80 100 120 140 160 180

    0

    50

    100

    150

    200

    250

    1000 Bootstrapped Means

    Mean conc. and Dose rate fixedmean dose

    Red Lines=BS CI

    Black Lines=Normal CI

  • 7/29/2019 Intro Bootstrap

    5/18

    An Example Data Set

    50 100 150 200 250 300 350

    0

    50

    100

    150

    200

    250

    300

    1000 Bootstrapped Means

    Mean Conc and Dose Rate Randommean dose

    Red Lines=BS CI

    Black Lines=Normal CI

  • 7/29/2019 Intro Bootstrap

    6/18

    Statistics in the Computer Age

    Efron and Tibshirani, 1991 in Science:Most of our familiar statistical methods, such ashypothesis testing, linear regression, analysis of

    variance, and maximum likelihood estimation, weredesigned to be implemented on mechanical calculators.Modern electronic computation has encouraged a host ofnew statistical methods that require fewer distributionalassumptions than their predecessors and can be applied

    to more complicated statistical estimatorswithout theusual concerns for mathematical tractability.

  • 7/29/2019 Intro Bootstrap

    7/18

    The Bootstrap Solution

    With the advent of cheap, high powercomputing, it has become relatively easy to useresampling techniques, such as the bootstrap, toestimate the distribution of sample statistics

    empirically rather than making distributionalassumptions. The bootstrap resamples the data with equal

    probability and with replacement and calculatesthe statistic of interest at each resampling. Theresulting histogram, mean, quantiles andvariance of the bootstrapped statistics providean estimate of its distribution.

  • 7/29/2019 Intro Bootstrap

    8/18

    Example

    Take the data set 1,2,3. There are 10possible resamplings, where re-orderingsare considered the same sampling.

    1,2,3 1,1,2

    1,1,3 2,2,1

    2,2,3 3,3,1

    3,3,2 1,1,1

    2,2,2 3,3,3

    1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 30

    5

    10

    15

    20

    25

    30

  • 7/29/2019 Intro Bootstrap

    9/18

    The Bootstrap Solution

    In general, the number ofbootstrap samples, Cn, is

    Table of possible distinct bootstrapre-samplings by sample size.

    2 1.

    1n

    nC

    n

    n 5 10 12 15 20 25 30

    Cn 126 92,378 1.35x104 7.76x105 6.89x1010 6.32x1013 5.91x1016

  • 7/29/2019 Intro Bootstrap

    10/18

    The Empirical Distribution Function

    Having observed a random sample of sizen from a probability distribution F,

    the empirical distribution function (edf),assigns to a setA in the sample space

    of x its empirical probability

    1 2, ,... nF x x x ,F

    # /iF A P A x A n

  • 7/29/2019 Intro Bootstrap

    11/18

    Example

    A random sample of 100 throws of a dieyields 13 ones, 19 twos, 10 threes, 17 fours,14 fives, and 27 sixes. Hence the edf is

    (1) 0.13

    (2) 0.19

    (3) 0.10

    F

    F

    F

    (4) 0.17

    (5) 0.14

    (6) 0.27

    F

    F

    F

  • 7/29/2019 Intro Bootstrap

    12/18

    The Plug-in Principle

    It can be shown that is a sufficient statisticfor F.

    That is, all the information about F contained

    in x is also contained in . The plug-in principle estimates

    by

    F

    F

    ( )T F

    ( )T F

  • 7/29/2019 Intro Bootstrap

    13/18

    The Plug-in Principle

    If the only information about F comes fromthe sample x, then is a minimumvariance unbiased estimator of .

    The bootstrap is drawing B samples from theempirical distribution to estimate B statisticsof interest,

    Hence, the bootstrap is both sampling froman edf (of the original sample) and generatingan edf (of the statistic).

    ( )T F

    * .

  • 7/29/2019 Intro Bootstrap

    14/18

    Graphical Representation of the Bootstrap

    x={x1,x2,,xn}

    x*1 x*2 x*3 . . x*B

    T(x*1) T(x*2) T(x*3) T(x*B)

    2

    1

    [ ( ) ]

    ( ( ) )1

    B

    b

    b

    T x t

    se T xB

    *

    1

    1 ( )

    B

    b

    b

    t T x T xB

  • 7/29/2019 Intro Bootstrap

    15/18

    Bootstrap Standard Error and Confidenceintervals.

    The bootstrap estimate of the mean is justthe empirical average of the statistic overall bootstrap samples.

    The bootstrap estimate of standard error isjust the empirical standard deviation ofthe bootstrap statistic over all bootstrapsamples.

  • 7/29/2019 Intro Bootstrap

    16/18

    Bootstrap Confidence Intervals

    The percentile interval: the bootstrapconfidence interval for any statistic issimply the a/2 and 1-a/2 quantiles.

    For example, if B=1000, then to constructthe BS confidence interval we rank thestatistics and take the 25th and the 975thvalues.

    There are other BS CIs but this is theeasiest and makes the fewest assumptions.

  • 7/29/2019 Intro Bootstrap

    17/18

    Example: Bootstrap of the Median

    Go to Splus.

  • 7/29/2019 Intro Bootstrap

    18/18

    How Good is the Bootstrap?

    The bootstrap, in most cases is as good as theempirical distribution function.

    The bootstrap is not optimal when there is goodinformation about F that did not come from thedatai.e. prior information or strong, validdistributional assumptions.

    The bootstrap does not work well for extremevalues and needs some what difficult

    modifications for autocorrelated data such astimes series.

    When all our information comes from the sampleitself, we can not do better than the bootstrap.