statistical methods in computer science the basis for experiment design for hypothesis testing ido...

Statistical Methods in Computer Science

The Basis for Experiment Design

for Hypothesis Testing

Ido Dagan

Reminders:1. Instructions for participating in the experiment are on the course website2. Excel Recitations:

Wednesday – in computer room 604/203Thursday – same room next week, no class this week** your BIU-CS login should be active **

Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan

3

Experimental Lifecycle

Model/Theory

Hypothesis

Experiment

Analysis


4

Proving a Theory?

Methods of proving a proposition An experiment supports it We can mathematically prove it

Some propositions cannot be verified empirically: “This compiler has linear run-time” Infinite possible inputs --> cannot prove empirically

But they may still be disproved: e.g., code that causes the compiler to run non-linearly


5Karl Popper's Philosophy of Science

Popper advanced a particular philosophy of science:Falsifiability

For a theory to be considered scientific, it must be falsifiable There must be some way to refute it, in principle Not falsifiable <==> Not scientific

Examples: “All crows are black” falsifiable by finding a white crow “Compile in linear time” falsifiable by non-linear

performance

Theory tested on its predictions


6

Proving by disproving...

Platt (“Strong Inference”, 1964) offers a specific method:1) Devise alternative hypotheses for observations2) Devise experiment(s) allowing elimination of hypotheses3) Carry out experiments to obtain a clean result4) Go to 1.

The idea is to eliminate hypotheses, by rejecting them


7

Forming Hypotheses

So, to support theory X, we:1) Construct falsifiability hypotheses X1,.... Xn, ....

2) Systematically experiment to disprove X, by proving Xi

3) If all falsification hypotheses eliminated, then this lends support to the theory

Note that future falsification hypotheses may be formed Theory must continue to hold against “attacks” Popper: Scientific evolution, “survival of the fittest theory”

E.g. Newton’s theory

How does this view hold in computer science?


8

Forming Hypotheses in CS

(1) Carefully identify the theoretical object we are studying: e.g., “the relation between input-size and run-time is

linear” e.g., “the display improves user performance”

(2) Identify falsification hypothesis (null hypothesis) H0 e.g., “there is an input-size for which run-time is non-linear” e.g., “the display will have no effect on user performance”

(3) Now, experiment to eliminate H0


9The Basics of Experiment Design

Experiments identify a relation between variables X, Y, ...

Simple experiments: Provide indication of a relation Better/worse, linear or non-linear, ....

Advanced experiments: help identify causes, interactions Linear in input size but constant factor depends on type of

data


10Types of Experiments and Variables

Manipulation experiments Manipulate (= set value of) independent variables (input size) Observe (measure value of) dependent variables (run time)

Observation experiments Observe predictor variables (person height) Observe response variables (running speed)

Also system run time – if observing system in actual use

Other variables: Endogenous: On causal path between independent and

dependent Exogenous: Other variables influencing dependent variables


11An example of observation experiment

Theory: Gender affects score performance

Falsifying hypothesis: Gender does not affect performance I.e. Men & women perform the same

Cannot use manipulation experiments Cannot control gender

Must use observation experiments


12An example observation experiment

(ala “Empirical methods in AI”, Cohen 1995)

# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

Independent (Predictor)Variables




# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

Dependent (Response)Variables




# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

EndogenousVariables




# Siblings: 2

Mother: artist

Gender: Male

Height: 145cm

Teacher's attitude

Child confidence

Test score: 650

# Siblings: 3

Mother: Doctor

Gender: Female

Height: 135cm

Teacher's attitude

Child confidence

Test score: 720

ExogenousVariables


16Experiment Design: Introduction

Different experiment types explore different hypotheses For instance, a very simple design: treatment experiment

Sometimes known as a lesion study

treatment Ind1 Ex1 Ex2 .... Exn Dep1

control Not(Ind1) Ex1 Ex2 .... Exn Dep2

Treatment condition: Independent variable set to “with treatment”

Control condition: Independent var set to “no treatment”

Populations are “identical” in all other variables

Determine relation of categorical var V0 and the dependent var

Variables: V0

V1 V

2... V

nDependent Variable


17Single-Factor Treatment Experiments

A generalization of treatment experiments Allow comparison of different conditions

treatment1 Ind1 Ex1 Ex2 .... Exn Dep1

treatment2 Ind2 Ex1 Ex2 .... Exn Dep2

[control Not(Ind) Ex1 Ex2 .... Exn Dep3 ]

Compare performance of algorithm A to B to C .... Control condition: Optional (e.g., to establish

baseline)

V0V1 V2

VnDependent Variable


18

Careful !

An effect on the dependent variable may not be as expected

Example: An experiment Hypothesis: fly's ear is on its wings Fly with two wings. Make loud noise. Observe flight. Fly with one wing. Make loud noise. No flight. Conclusion: Fly with only one wing cannot hear!

What's going on here? First, interpretation by the experimenter But also, lack of sufficient falsifiability:

There are other possible explanations for why fly wouldn't fly – another variable (wing) affecting the dependent variable (flying)


19

Controlling for other factors

Often, we cannot manipulate all exogenous variables Then, we need to make sure they are sampled randomly

Randomization averages out their effect

This can be difficult e.g.,, suppose we are trying to relate gender and math We control for effect of # of siblings by random sampling But # of siblings may be related to gender:

Parents continue to have children hoping for a boy (Beal 1994) Thus # of siblings tied with gender

Must separate results based on # of siblings


20Factorial Experiment Designs

• Every combination of factor values is sampled– Hope is to exclude or reveal interactions

• This creates a combinatorial number of experiments– N factors, k values each = kN combinations

• Strategies for eliminating values:– Merge values, categories. Skip values.– Focus on extremes, to get a general trend

• But may hide behavior at intermediate values


21Tips for Factorial Experiments

For “numerical” variables, 2 value ranges are not enough Don't give a good sense of the function relating

variables.

Measure, measure, measure. Piggybacking measurements on planned experiments:

cheaper than re-running experiments


22

Experiment Validity

Type of validity: Internal and External validity Internal validity:

Experiment shows relationship (independent causes dependent)

External validity: Degree to which results generalize to other conditions

Threats: uncontrolled conditions threatening validity


23Internal validity threats: Examples

Order effects Practice effects in human or animal test subjects

E.g. user performance improves in user interface tasks Solution: randomize order of presentation to subjects

Bug or side-effects in testing system leaves system “unclean” for next trial – need to “clean” system between experiments

If treatment/control given in two different orders E.g. run with/without new algorithm operating, for same

users Order may be good for treatment, bad for control (or vice

versa) Solution: counter-balancing (all possible orders)

Demand effects Experimenter influences subject

e.g., guiding subjects Confounding effects – variable relations aren’t clear

See “fly with no wings cannot hear”


24

External threats to validity

Outline:

Sampling bias: Non-representative samples e.g., non-representative external factors

Floor and ceiling effects Problems tested too hard, too easy

Regression effects Results have no way to go but up or down

Solution approach: Run pilot experiments


25

Sampling Bias

Setting prefers measuring specific values over others For instance:

“Random” manual selection of mice from cage for experiment

Specific values: slow, doesn’t bite (not aggressive), … Including results that were found by some deadline

Solution: Detect, and remove e.g., by visualization, looking for non-normal distributions e.g., surprising distribution of dependent data, for different

values of independent variable.


26Baselines: Floor and Ceiling Effects

How do we know A is good? Bad? Maybe the problems are too simple? Too hard?

For example New machine learning algorithm has 95% accuracy Is this good?

Controlling for Floor/Ceiling Establish baselines Show that a “silly” approach achieves close result Comparison to strawman (easy), ironman (hard)

May be misleading if not chosen appropriately


27

Regression Effects

General phenomenon: “Regression towards the mean”

Repeated measurement converges towards mean values

Example threat: Run a program on 100 different inputs Problems 6, 14, 15 get a very low score We now fix the problem that affected only these inputs, and

want to re-test If chance has anything to do with scoring, then must re-run

all Why?

Scores on 6, 14, 15 has no where to go but up. So re-running these problems will show improvement by

chance Solution:

Re-run complete tests, or sample conditions uniformly


28

Summary

Defensive thinking If I were trying to disprove the claim, what would I do Then think ways to counter any possible attack on claim

Strong Inference, Popper's falsification ideas Science moves by disproving theories (empirically)

Experiment design: Ideal independent variables: easy to manipulate Ideal dependent variables: measurable, sensitive, and

meaningful Carefully think through threats

statistical methods in computer science the basis for experiment design for hypothesis testing ido...

Documents