empirical research methods in computer science lecture 4 november 2, 2005 noah smith

Empirical Research Methods in Computer Science

Lecture 4November 2, 2005Noah Smith

Today

Review bootstrap estimate of se (from homework).

Review sign and permutation tests for paired samples.

Lots of examples of hypothesis tests.

Recall ...

There is a true value of the statistic. But we don’t know it.

We can compute the sample statistic.

We know sample means are normally distrubuted (as n gets big):

nˆ

n)x(se xx

But we don’t know anything about the distribution of other sample statistics (medians, correlations, etc.)!

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X*

bootstrap replication *)X(s*ˆ

F̂

statistics about the estimate (e.g., standard error)

Bootstrap estimate of se

Run B bootstrap replicates, and compute the statistic each time:θ*[1], θ*[2], θ*[3], ..., θ*[B]

B

1i

2

B1B

*ˆ]i[*ˆ*ˆse

B

]i[*ˆ

*ˆ

B

1i

(mean of θ* across replications)

(sample standard deviation of θ* across replications)

Paired-Sample Design

pairs (xi, yi) x ~ distribution F y ~ distribution G How do F and G differ?

Sign Test

H0: F and G have the same medianmedian(F) – median(G) = 0

Pr(x > y) = 0.5 sign(x – y) ~ binomial distribution compute bin(N+, 0.5)

N

Nn

5.0,nbinp

Sign Test

nonparametric(no assumptions about the data)

closed form(no random sampling)

Example: gzip speed

build gzip with –O2 or with –O0

on about 650 filesout of 1000,gzip-O2 was faster

binomial distribution, p = 0.5, n = 1000p < 3 x 10-24

Permutation Test

H0: F = G Suppose difference in sample

means is d. How likely is this difference (or a

greater one) under H0? For i = 1 to P

Randomly permute each (xi, yi) Compute difference in sample means

Permutation Test

nonparametric(no assumptions about the data)

randomized test

Example: gzip speed

1000 permutations:difference ofsample meansunder H0 iscentered on 0

-1579 is veryextreme; p ≈ 0

Comparing speed is tricky!

It is very difficult to control for everything that could affect runtime.

Solution 1: do the best you can. Solution 2: many runs, and then

do ANOVA tests (or their nonparametric equivalents).

“Is there more variance between conditions than within conditions?”

Sampling method 1

for r = 1 to 10 for each file f

for each program p time p on f

Result (gzip first)student 2’s program faster than gzip!

Result (student first)

student 2’s program is slower than gzip!

Sampling method 1

for r = 1 to 10 for each file f

for each program p time p on f

Order effects

Well-known in psychology. What the subject does at time t

will affect what she does at time t+1.

Sampling method 2

for r = 1 to 10 for each program p

for each file f time p on f

Result

gzip wins

Sign and Permutation Tests

median(F) median(G)

all distribution pairs (F, G) F G


median(F) median(G)


sign test rejects H0


median(F) median(G)


permutation test rejects H0


median(F) median(G)


permutation test rejects H0

sign test rejects H0

There are other tests!

We have chosen two that are nonparametric easy to implement

Others include: Wilcoxon Signed Rank Test Kruskal-Wallis (nonparametric

“ANOVA”)

Pre-increment?

Conventional wisdom:

“Better to use ++x than to use x++.”

Really, with a modern compiler?

Two (toy) programs

for(i = 0; i < (1 << 30); ++i)j = ++k;

for(i = 0; i < (1 << 30); i++)j = k++;

ran each 200 times (interleaved) mean runtimes were 2.835 and 2.735 significant well below .05

What?

leal -8(%ebp), %eaxincl (%eax)movl -8(%ebp), %eax

movl -8(%ebp), %eaxleal -8(%ebp), %edxincl (%edx) %edx is not used anywhere else

Conclusion

Compile with –O and the assembly code is identical!

Why was this a dumb experiment?

Pre-increment, take 2

Take gzip source code. Replace all post-increments with

pre-increments, in places where semantics won’t change.

Run on 1000 files, 10 times each. Compare average runtime by file.

Sign test

p = 8.5 x 10-8

Permutation test

Conclusion

Preincrementing is faster!

... but what about –O? sign test: p = 0.197 permutation test: p = 0.672

Preincrement matters without an optimizing compiler.

Your programs ...

8 students had a working program both weeks.

6 people changed their code. 1 person changed nothing. 1 person changed to –O3. 3 people lossy in week 1. Everyone lossy in week 2!

Your programs!

Was there an improvement on compression between the two versions?

H0: No. Find sampling distribution of

difference in means, using permutations.

Student 1 (lossless week 1)

Compression < 1?

Student 2: worse compression

Compression < 1?

Student 3

Student 6

Student 7

Student 8

Homework Assignment 2

6 experiments:1. Does your program compress text or

images better?2. What about variance of compression?3. What about gzip’s compression?4. Variance of gzip’s compression?5. Was there a change in the

compression of your program from week 1 to week 2?

6. In the runtime?

Remainder of the course

11/9: EDA 11/16: Regression and learning 11/23: Happy Thanksgiving! 11/30: Statistical debugging 12/7: Review, Q&A Saturday 12/17, 2-5pm: Exam

empirical research methods in computer science lecture 4 november 2, 2005 noah smith

Documents

f slide

g f g slide

distribution pairs f

distribution f y

distribution g

random sampling slide

g f g sign test rejects

standard error slide