Download - Empirical Research Methods in Computer Science Lecture 2, Part 1 October 19, 2005 Noah Smith
Empirical Research Methods in Computer Science
Lecture 2 Part 1October 19 2005Noah Smith
Some tips Perl scripts can be named encode instead
of encodepl encode foo ≢ encode lt foo chmod u+x encode Instead of making us run java Encode
write a shell script binsh cd `dirname $0` java Encode
Check that it works on (say) ugrad10
Assignment 1
If you didnrsquot turn in a first version yesterday donrsquot bother ndash just turn in the final version
Final version due Tuesday 1025 8pm
We will post a few exercises soon Questions
Today
Standard error Bootstrap for standard error Confidence intervals Hypothesis testing
Notation
P is a population S = [s1 s2 sn] is a sample from P
Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F
We may use Y Z for other measurements
Mean
What does mean mean μx is population mean of x
(depends on F)
μx is in general unknown
How do we estimate the mean Sample mean
n
xx
n
1ii
Gzip compression rate
usually lt 1 but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Some tips Perl scripts can be named encode instead
of encodepl encode foo ≢ encode lt foo chmod u+x encode Instead of making us run java Encode
write a shell script binsh cd `dirname $0` java Encode
Check that it works on (say) ugrad10
Assignment 1
If you didnrsquot turn in a first version yesterday donrsquot bother ndash just turn in the final version
Final version due Tuesday 1025 8pm
We will post a few exercises soon Questions
Today
Standard error Bootstrap for standard error Confidence intervals Hypothesis testing
Notation
P is a population S = [s1 s2 sn] is a sample from P
Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F
We may use Y Z for other measurements
Mean
What does mean mean μx is population mean of x
(depends on F)
μx is in general unknown
How do we estimate the mean Sample mean
n
xx
n
1ii
Gzip compression rate
usually lt 1 but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Assignment 1
If you didnrsquot turn in a first version yesterday donrsquot bother ndash just turn in the final version
Final version due Tuesday 1025 8pm
We will post a few exercises soon Questions
Today
Standard error Bootstrap for standard error Confidence intervals Hypothesis testing
Notation
P is a population S = [s1 s2 sn] is a sample from P
Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F
We may use Y Z for other measurements
Mean
What does mean mean μx is population mean of x
(depends on F)
μx is in general unknown
How do we estimate the mean Sample mean
n
xx
n
1ii
Gzip compression rate
usually lt 1 but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Today
Standard error Bootstrap for standard error Confidence intervals Hypothesis testing
Notation
P is a population S = [s1 s2 sn] is a sample from P
Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F
We may use Y Z for other measurements
Mean
What does mean mean μx is population mean of x
(depends on F)
μx is in general unknown
How do we estimate the mean Sample mean
n
xx
n
1ii
Gzip compression rate
usually lt 1 but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Notation
P is a population S = [s1 s2 sn] is a sample from P
Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F
We may use Y Z for other measurements
Mean
What does mean mean μx is population mean of x
(depends on F)
μx is in general unknown
How do we estimate the mean Sample mean
n
xx
n
1ii
Gzip compression rate
usually lt 1 but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Mean
What does mean mean μx is population mean of x
(depends on F)
μx is in general unknown
How do we estimate the mean Sample mean
n
xx
n
1ii
Gzip compression rate
usually lt 1 but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Gzip compression rate
usually lt 1 but not always
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Gzip compression rate
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Accuracy
How good an estimate is the sample mean
Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of
samples from P There is some ldquotruerdquo se value
x
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Extreme cases
n rarr infin
n = 1
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Standard error (of the sample mean)
Known
ldquoStandard errorrdquo = standard deviation of a statistic
n)x(se x
true standard deviation of x under F
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Gzip compression rate
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Central Limit Theorem
The sampling distribution of the sample mean approaches a normal distribution as n increases
nμx
2xN
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
How to estimate σx
ldquoPlug-in principlerdquo
Therefore
n
1i
2i xx
n1
ˆ
n
1i
2
i
nxx
nˆ
xse
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Plug-in principle
We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution
over X We do have S (the sample)
We do know the sample distribution over X
Estimating a statistic use for F
F
F
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Good and Bad News
We have a formula to estimate the standard error of the sample mean
We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Bootstrap world
unknown distribution F
observed random sample X
statistic of interest )X(sˆ
empirical distribution
bootstrap random sample X
bootstrap replication )X(sˆ
F
statistics about the estimate (eg standard error)
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Bootstrap sample
X = [30 28 37 34 35] X could be
[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]
Draw n elements with replacement
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Reflection
Imagine doing this with a pencil and paper
The bootstrap was born in 1979 Typically sampling is costly and
computation is cheap In (empirical) CS sampling isnrsquot even
necessarily all that costly
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Bootstrap estimate of se
Let s() be a function for computing an estimate
True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap
samples
seF
FF
seˆse
BB seˆse
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Bootstrap estimate of se
B
1i
2
B1B
ˆ]i[ˆˆse
FBB
seselim
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Bootstrap intuitively
We donrsquot know F We would like lots of samples from P
but we only have one (S) We approximate F by
Plug-in principle Easy to generate lots of ldquosamplesrdquo
from
F
F
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
B = 25 (mean compression)
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
B = 50 (mean compression)
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
B = 200 (mean compression)
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Correlation (another statistic)
Population P sample S Two values xi and yi for each element
of the sample Correlation coefficient ρ Sample correlation coefficient
n
1i
2i
n
1i
2i
n
1iii
yyxx
yyxxr
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Example gzip compression
r = 09616
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Accuracy of r
No general closed form for se(r) If we assume x and y are bivariate
Gaussian
3n
r1)r(se
2
normal
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
-1-05
005
110
2030
4050
6070
8090
100
-05
0
05
1
senormal
rn
senormal
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Normality
Why assume the data are Gaussian
Alternative bootstrap estimate of the standard error of r
B
1i
2
B1B
r]i[rrse
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
Example gzip compression
r = 09616
senormal(r) = 00024
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
se200(r) = 00298
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help
se bootstrap advice
Plot the data Runtime Efron and Tibshirani
B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)
Summary so far
A statistic is a ldquotrue factrdquo about the distribution F
We donrsquot know F For some parameter θ we want
estimate ldquoθ hatrdquo accuracy of that estimate (eg standard
error) For the mean μ we have a closed
form For other θ the bootstrap will help