hypothesis testing. select 50% users to see headline a ◦ titanic sinks select 50% users to see...

31
Hypothesis Testing

Upload: neil-neal

Post on 05-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

Hypothesis Testing

Page 2: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

2

Select 50% users to see headline A◦ Titanic Sinks

Select 50% users to see headline B◦ Ship Sinks Killing Thousands

Do people click more on headline A or B?

The New York Times Daily Dilemma

Page 3: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

3

Two Populations

Testing Hypotheses

12

?

?

109

?

?

4

?

Which one has the largest average?

?

?

Page 4: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

4

The two-sample t-test

Is difference in averages between two groups more than we would expect based on chance alone?

Page 5: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

5

More Broadly: Hypothesis Testing Procedures

Hypothesis Testing

Procedures

Parametric

Z Test t Test Cohen's d

Nonparametric

Wilcoxon Rank Sum

Test

Kruskal-Walli H-

Test

Kolmogorov-Smirnov

test

Page 6: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

6

Parametric Test Procedures

Tests Population Parameters (e.g. Mean)

Distribution Assumptions (e.g. Normal distribution)

Examples: Z Test, t-Test, 2 Test, F test

Page 7: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

7

Nonparametric Test Procedures

Not Related to Population ParametersExample: Probability Distributions, Independence

Data Values not Directly UsedUses Ordering of Data

Examples: Wilcoxon Rank Sum Test , Komogorov-Smirnov Test

Page 8: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

8

In class experiment with R

Left= c(20,5,500,15,30) Right = c(0,50,70,100)

t.test(Left, Right, alternative=c("two.sided","less","greater"), var.equal=TRUE, conf.level=0.95)

Two-sample t-Test

Page 9: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

t-Test (Independent Samples)

H0: μ1 - μ2 = 0

H1: μ1 - μ2 ≠ 0

The goal is to evaluate if the average difference between two populations is zero

The t-test makes the following assumptions

• The values in X(0) and X(1) follow a normal distribution

• Observations are independent

Two hypotheses:

9

Page 10: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

General t formula

t = sample statistic - hypothesized population parameter estimated standard error

Independent samples t

Empirical averages

Estimated standard deviation??

t-Test Calculation

10

Page 11: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

Standard deviation of difference in empirical averages

t-Test: Standard Deviation Calculation

How much variance when we use average difference of observations to represent the true average difference?

11

Page 12: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

t-Test: Standard Deviation Calculation (2/2)

Standard deviation of difference in empirical averages

with degrees of freedom

Also known as Welsh’s t

Sample variance of X(0)

Number of observationsin X(0)

12

Page 13: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

13

What is the p-value?

Can we ever accept hypothesis H1 ?

t-Statistics p-valueH0: μ1 - μ2 = 0

H1: μ1 - μ2 ≠ 0

Page 14: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

14

Effect Size

Page 15: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

t-Test tests only if the difference is zero or not.

What about effect size?

Cohen’s d

where s is the pooled variance

t-Test: Effect Size

15

Page 16: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

16

Bayesian Approach

Page 17: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

17

Probability of hypothesis given data

The Bayes factor

Bayesian Approach

Page 18: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

18

Example of Nonparametric Test

Page 19: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

19

Two-sample Kolmogorov-Smirnov Test◦ Do X(0) and X(1) come from same underlying distribution?

◦ Hypothesis (same distribution)rejected at level p if

Nonparametric Testing of Distributions

Wikipedia

Em

pir

ical

Confidence interval factor

Sample size correction

The K-S test is less sensitive when the differences between curves is greatest at the beginning or the end of the distributions. Works best when distributions differ at center.

Good reading: M. Tygert, Statistical tests for whether a given set of independent, identically distributed draws comes from a specified probability density. PNAS 2010

Page 20: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

20

Are Two User Features Independent?

Page 21: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

Twitter users can have gender and number of tweets.

We want to determine whether gender is related to number of tweets.

Use chi-square test for independence

Chi-Squared Test

21

Page 22: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

When to use chi-square test for independence:◦ Uniform sampling design◦ Categorical features◦ Population is significantly larger than sample

State the hypotheses:◦ H0 ?

◦ H1 ?

When to use Chi-Squared test

22

Page 23: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

men = c(300, 100, 40)

women = c(350, 200, 90)

data = as.data.frame(rbind(men, women))

names(data) = c('low', 'med', 'large')

data

chisq.test(data)

Reject H0 (p<0.05) means …

Example Chi-Squared Test

23

Page 24: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

24

Deciding Headlines

Page 25: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

25

Select 50% users to see headline A◦ Titanic Sinks

Select 50% users to see headline B◦ Ship Sinks Killing Thousands

Assign half the readers to headline A and half to headline B?◦ Yes?◦ No?◦ Which test to use?

What happens A is MUCH better than B?

Revisiting The New York Times Dilemma

Page 26: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

26

How to stop experiment early if hypothesis seems true◦ Stopping criteria often needs to be decided before

experiment starts◦ If ever needed:

Sequential Analysis (Sequential Hypothesis Test)

Page 27: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

27

But there is a better way…

Page 28: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

28

K distinct hypotheses (so far we had K = 2)◦ Hypothesis = choosing NYT headline

Each time we pull arm i we get reward Xi

(simple version of problem)

Underlying population (reward distribution) does not change over time

Bandit algorithms attempt to minimize regret◦ If n = total actions ; ni = total actions i

Bandit Algorithms

largest true average

Page 29: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

29

Note that regret is defined over the true average reward

How can we estimate true average reward Xi?◦ We need to get lots of observations from population i◦ But what happens if E[Xi] is small?

Core of decision making problems: ◦ Exploration vs. exploitation◦ When exploring we seek to improve estimated average reward ◦ When exploiting we try what has worked better in the past

Balancing exploration and exploitation: ◦ Instead of trying the action with highest estimated average, we

try the action with the highest upper bound on its confidence interval(more on this next class)

Challenge

Page 30: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

30

Multi-Armed Bandit (MAB)◦ Bandit process is a special type of Markov Decision

Process◦ Generally, reward Xi(ni) at ni–th arm pull of arm i is P[Xi(ni) |

Xi(ni-1)]

UCB1◦ Use arm i that maximizes

UCB1 (Upper Confidence Bound 1)

Page 31: Hypothesis Testing.  Select 50% users to see headline A ◦ Titanic Sinks  Select 50% users to see headline B ◦ Ship Sinks Killing Thousands  Do people

31

R ExamplenumT <- 2500 # number of time stepsttest <- c()# mean of population 1mean1 = 0.4# mean of population 2mean2 = 0.7# initialize observationsx1 <- c(rbinom(n=1,size=1,prob=mean1))x2 <- c(rbinom(n=1,size=1,prob=mean2))n1 = 1n2 = 1for (i in 2:numT){ # compute reward of bandit 1 reward_1 = mean(x1) + sqrt(2*log(i)/n1) # compute reward of bandit 2 reward_2 = mean(x2) + sqrt(2*log(i)/n2) # decides which arm to pull if (reward_1 > reward_2) { x1 <- c(rbinom(n=1,size=1,prob=mean1),x1) n1 = n1 + 1 } else { x2 <- c(rbinom(n=1,size=1,prob=mean2),x2) n2 = n2 + 1 } # computes the t-Test p-value of observations if ((n1 > 2) && (n2 > 2)) { d <- t.test(x1,x2)$p.value } else { d = 1 } ttest <- c(ttest, d)}par(mfrow=c(1,2))plot(2:numT,ttest,"l",xlab="Time",ylab="t-Test pvalue",log="y")barplot(c(n1,n2),xlab="Arm Pulls",ylab="Observations",names.arg=c("n0", "n1"))