lies, damned lies, & search marketing statistics by adria kyne

#SMX #12C3 @AdriaKHow to Avoid the First Two When Producing the Latter

Lies, Damned Lies, and Search Marketing Statistics

Adria KyneVistaprint

#SMX #12C3 @AdriaK

Problems• Using samples that are too small• Using significance as a stopping point for a testSolutions• More rigor with fixed-sample tests• Using sequential sampling tests• Bayesian testingBonus Pro Tip for achieving valid samples

Today’s Topics

#SMX #12C3 @AdriaK

1) Make sure that we understand what actually happened

2) Be sure that we can use these results to predict the future

What is the Whole Point of This Anyway?

#SMX #12C3 @AdriaK

1. We want to know whether the variation is better, worse, or the same as the original.

2. We don’t want to see a positive outcome that isn’t really there— a false positive or Type I error

3. We don’t want to miss a positive outcome—a Type II error.

Basics of Hypothesis Testing

#SMX #12C3 @AdriaK

Your product page has an average 2.0% CR. You make a bunch of tweaks to the design, and after 30,000 visits, your CR is 2.25%.

You think you’re a genius, and so you tell your boss. Score!

#1 A Common (Sad) Story

#SMX #12C3 @AdriaK

At the end of the month, your revenue is no higher.You look bad.

The change you saw was not “significant,” because your sample size wasn’t big enough.

Yes, 30,000 visits was not enough.

You spoke too soon.

#SMX #12C3 @AdriaK

I gotta be cruel to be kind.

#SMX #12C3 @AdriaK

The smaller the difference, the bigger the sample you’ll need:

2% - 3% is a 50% increase

2.0%-2.5% is a 25% increase

2.0% - 2.25% is a 12.5% increase

For standard A/B hypothesis tests

#SMX #12C3 @AdriaK

Decide on How Much Impact Your Change Should Have

Visits CROrder

s AOVRevenu

eAnnual

Increase20,00

02.00

%

400 $50 $20,00

0 20,00

02.25

%

450 $50 $22,50

0 $30,000 20,00

02.50

%

500 $50 $25,00

0 $60,000 How much of a difference do you want to be able to detect with your test?

#SMX #12C3 @AdriaK

“power analysis for two independent proportions”

Pick a Sample Size Calculator

minimum sample size

we’re showing the variants to different visitors

we’re comparing rates, which are proportions

#SMX #12C3 @AdriaK

Is variation is higher or lower than the original? “two-tailed test.”

A 5% significance level is common—that is, there’s a 5% chance of a false positive

80% statistical power is common—there is a 20% chance (1 in 5) that if there was an effect, we’d miss it.

Calculator Options

http://bit.ly/25zI5Rv

P1 = your control CR, e.g. 0.02 for 2%P2= your likely test CR, e.g. 0.025 for 2.5%

http://bit.ly/25zI5Rv

#SMX #12C3 @AdriaK

The effect of using 0.05 and 80% is that we are 4 times more likely to get a false negative than a false positive

We’re more concerned about making things worse

We accept a higher chance that won’t see a positive effect that is actually there

Consequences of Significance and Power Choices

#SMX #12C3 @AdriaK

Those are arbitrary choices.We’re not testing pharmaceuticals.Are we really so terrified that we’ll roll out a page that isn’t an improvement?

NOBODY IS GOING TO DIE

#SMX #12C3 @AdriaK

Means that I love you.Baby.

#SMX #12C3 @AdriaK

Necessary Sample Sizes

1% change

13,809

3,826

0.5% change

52,238

#SMX #12C3 @AdriaK

Requires 52,238 Visits

Detecting a 12.5% increase in Conversion Rate

For each sample

#SMX #12C3 @AdriaK

Photo by Marilynn Windust https://ronmitchelladventure.com

#SMX #12C3 @AdriaK

You’re hoping for a 0.25% uplift on a 2.0% average CR.

The Control is getting 2.0% CR, and the Variant is getting 3.0% CR!

#2 Another Common (Sad) Story

“Why haven’t we switched to the test variant? It’s CLEARLY

WINNING.”

#SMX #12C3 @AdriaK

So you test the significance level.

Success! The difference is significant. You roll out the new page, and...

...nothing happens

And this is how things go awry

#SMX #12C3 @AdriaK

A significance calculation assumes that the sample size was

fixed in advance

It assumes that you have a valid sample

So when you ignore this and run until you get a “significant result,” you’re misusing the math

Why didn’t it work?

#SMX #12C3 @AdriaK

If you hit a period that happens to be performing well

You may succumb to the temptation to stop while you’re ahead

Repeated significance testing increases the rate of false positives

Friends don’t let friends test significance prematurely

Image: Public Domain, via Wikipedia

#SMX #12C3 @AdriaK

Why repeated significance testing is a problem

#SMX #12C3 @AdriaK

5% significance means that even if there is no difference between the test and the control

We’ll see an imaginary difference in the test 5% of the time

Remember what significance means?

#SMX #12C3 @AdriaK

Repeated Significance Testing is The Devil

Given: there is no actual difference between two test variants

Option 1 Option 2 Option 3 Option 41st observation

Significant No difference


2nd observation

- Significant - No difference

End of Test Significant Significant Significant No difference

Likelihood ? ?

Option 1 Option 2 Option 3 Option 41st observation



2nd observation

Significant Significant No difference

No difference

End of Test Significant Significant No Difference

No difference

Likelihood 5% chance 95% chanceOption 1 Option 2 Option 3 Option 4

1st observation



2nd observation

- Significant - No difference

End of Test Significant Significant Significant No difference

Likelihood 26% chance 74% chance

Option 1 Option 21st observation Significant No difference Likelihood 5% chance 95% chance

#SMX #12C3 @AdriaK

See the slippery slope in action!

Day 1Control 150

2.00% 2.01%Variant 175

2.25% 2.35%Visits/Variant 7,460

not

Day 1 Day 2Control 150 313

2.00% 2.01% 2.10%Variant 175 332

2.25% 2.35% 2.23%Visits/Variant 7,460 14,920

not not

Day 1 Day 2 Day 3Control 150 313 448

2.00% 2.01% 2.10% 2.00%Variant 175 332 498

2.25% 2.35% 2.23% 2.23%Visits/Variant 7,460 14,920 22,380

not not not

Day 1 Day 2 Day 3 Day 4Control 150 313 448 636

2.00% 2.01% 2.10% 2.00% 2.13%Variant 175 332 498 695

2.25% 2.35% 2.23% 2.23% 2.33%Visits/Variant 7,460 14,920 22,380 29,840

not not not not

Day 1 Day 2 Day 3 Day 4 Day 5Control 150 313 448 636 750

2.00% 2.01% 2.10% 2.00% 2.13% 2.01%Variant 175 332 498 695 835

2.25% 2.35% 2.23% 2.23% 2.33% 2.24%Visits/Variant 7,460 14,920 22,380 29,840 37,300

not not not not SIGNIFICANT

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6Control 150 313 448 636 750 922

2.00% 2.01% 2.10% 2.00% 2.13% 2.01% 2.06%Variant 175 332 498 695 835 993

2.25% 2.35% 2.23% 2.23% 2.33% 2.24% 2.22%Visits/Variant 7,460 14,920 22,380 29,840 37,300 44,760

not not not not SIGNIFICANT not

Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7Control 150 313 448 636 750 922 1098

2.00% 2.01% 2.10% 2.00% 2.13% 2.01% 2.06% 2.10%Variant 175 332 498 695 835 993 1174

2.25% 2.35% 2.23% 2.23% 2.33% 2.24% 2.22% 2.25%Visits/Variant 7,460 14,920 22,380 29,840 37,300 44,760 52,220

not not not not SIGNIFICANT not not

#SMX #12C3 @AdriaK

Smart marketers PRE-COMMIT to a valid sample size

And do not test for significance before they’ve collected it!

Therefore:

#SMX #12C3 @AdriaK

Because you have to be able to satisfy impatient observers

But I neeeeeeed to test significance repeatedly!

#SMX #12C3 @AdriaK

Solves the problem of repeated significance testingAllows you to stop the test early if the Variant is a winnerWorks with low conversion rates (under 10%)

Sequential A/B Testing

Image: http://geneticsandbeyond.blogspot.com/2014/08/the-puffinss-lair-sweat-of-hippos.html

#SMX #12C3 @AdriaK

1. Determine your sample size N (number of total conversions)

2. Measure the success of your Control and Variant groups

3. Check for stopping points If Variant - Control = 2.25√N the Variant wins If Control - Variant = 2.25√N the Control wins If Variant + Control = N, there is no winner

Sequential experiment design

http://bit.ly/1sSDz29

#SMX #12C3 @AdriaK

Sequential Sampling Calculator

http://bit.ly/1TM1LKv

#SMX #12C3 @AdriaK

Given a baseline conversion rate pMinimum detectable effect you want to see is d

1.5p + d < 36%When less than 36%, a sequential test will be shorter

p = 2.0%, d = 12.5% (2.25% CR) 1.5p + d = 15.5%

When to choose a fixed sample vs. a sequential test

#SMX #12C3 @AdriaK

Variant CR = better than ControlP-value = 0.18 (i.e. greater than our 0.05 significance level)

When Good Math Leads to Bad Career Moves

So how did the test go?

Neither. We didn’t achieve significance.

So which version won?

We stopped this morning.

So why did you stop it?!

Just show it to another 10,000 visitors.

We can’t do that. We have to accept that the test is over.

This guy is not a team player.

I am so screwed..

Well, the null hypothesis... blah blah blah

Blah blah p-value blah blah hlah blah

Image: 20th Century Fox via Amazon

#SMX #12C3 @AdriaK

Communicating results is hard.

So which one performs better?

There is a 95% probability that the results we saw are not due to random chance!

Why can’t this guy just answer a straight question?

I hate my life.Image: 20th Century Fox via Amazon

#SMX #12C3 @AdriaK

How to stop your test at any time and still make valid inferences!!

Much easier to understand and explain the results!!

Bayes’ Theorem

Image via Wikipedia

#SMX #12C3 @AdriaK

Frequentist BayesianAssumes that there is no difference,and finds the probability that chance alone could have produced the experimental results seen

Focuses on not getting Type I errors

Most people don’t understand what the results mean

What’s the Difference?

Finds the probability that the test is better

More forgiving of Type I errors

Easier to understand and communicate to non-technical audiences

#SMX #12C3 @AdriaK

Calculus

Why Don’t Marketers Use Bayes’ Theorem?

This formula determines is the probability that B will beat A in the long run. There’s a slightly different one if you have three test groups, etc.

#SMX #12C3 @AdriaK

Online calculators are your friends!

But Wait!

#SMX #12C3 @AdriaK

Wins and losses dataGraph• Probability distributionsTable• Probability of being

best• Spread of conversion

rates

Cool online Bayesian calculator

http://bit.ly/24mKJaY

#SMX #12C3 @AdriaK

1. Decide on the probability you’re comfortable with

2. Decide how much variance you’re willing to accept

How to use this calculator

#SMX #12C3 @AdriaK

96% probability that B is betterBut what’s the real CR?Needs more data

High spread, less overlap

#SMX #12C3 @AdriaK

Not very much CR varianceBut B is only 70% likely to be better

Low spread, high overlap

#SMX #12C3 @AdriaK

Variance of CR isn’t as badSeparation of peaks means that the CRs are different94% probability that B is probably betterWe aren’t certain about the actual CR

Less spread, less overlap

Sample size is only 100 conversions each!

#SMX #12C3 @AdriaK

You might actually see

#SMX #12C3 @AdriaK

Allows you to start the test with some assumptions, called “priors”

Can include: • the prior success probability (our belief about the

average conversion rate)• How much variance you expect

Bayesian’s interesting twist

#SMX #12C3 @AdriaK

1. Set your “priors”2. Input your test data3. Get back the

probability that the test variant performs better

Different cool Bayesian calculator

http://bit.ly/1Wzrtro

#SMX #12C3 @AdriaK

Actual, Understandable Results

#SMX #12C3 @AdriaK

You can make inferences from low traffic and low conversions

When someone says "What's the probability that the new page outperforms the old one?", you can give them an answer!

Advantage of Bayesian results

#SMX #12C3 @AdriaK

1. You know how not to run a fixed sample test 2. You know you can run a sequential sample test

when you need ongoing information about the results

3. You know how to run a Bayesian test, where you can keep checking your progress AND explain the results easily

So now what?

#SMX #12C3 @AdriaK

Are you trying to detect a big difference, or a small difference?

Use the formula 1.5p + d big difference - use a normal fixed sample test

(>36%) small difference - use a sequential test (< 36%)

Do the people you report to get confused or unhappy when you try to explain significance and p-values to them?

Run a Bayesian test

Review: How to Design your Experiment

#SMX #12C3 @AdriaK

Tests using significance Bayesian test1. Use a sample calculator 2. Run the test for the specified

sample3. Profit!

So That’s It, Then?

1. Decide how solid you want your probability estimate to be

2. Run the test and update the data

3. Profit!

#SMX #12C3 @AdriaK

I’m all about the tough love.

#SMX #12C3 @AdriaK

We are not measuring consistent user groups• Time of day• Day of week• Seasonality• Sales

The Problem of Illusory Lift

#SMX #12C3 @AdriaK

Run your tests long enough to cover at least

one entire traffic/conversion cycle

Monday-Sunday or equivalent full week

Account for business cycles

#SMX #12C3 @AdriaK

Daily differences in performance

#SMX #12C3 @AdriaK

Don’t run your test too long

Visitors delete their cookies and will pollute your samples

Account for user behavior

#SMX #12C3 @AdriaK

Nearly 40 percent of Internet users delete cookies from their primary computers on at least a monthly

basis

53 percent delete cookies, cache or browsing history to help protect their privacy online

It’s probably more than you think

JupiterResearch 2005

TRUSTe/National Cyber Security Alliance U.S. Consumer Privacy Index January 2016

#SMX #12C3 @AdriaK

• Pre-commit to a sample size/experimental design• Fixed Sample A/B testing – no peeking before it’s

done• Sequential A/B testing – built-in peeking • Bayesian – easier to understand the results• Collect samples for a full business cycle, but not too

long

Summary

#SMX #12C3 @AdriaK

Fixed sample calculator Stats Dept., U of British Columbia http://bit.ly/25zI5RvSequential sampling calculatorEvan Miller http://bit.ly/1TM1LKv Simple Bayesian calculatorPeak Conversion http://bit.ly/24mKJaY Bayesian calculator with priorsLyst http://bit.ly/1Wzrtro

Calculators I used

#SMX #12C3 @AdriaKLEARN MORE: UPCOMING @SMX EVENTS

THANK YOU! SEE YOU AT THE NEXT #SMX

http://marketinglandevents.com/smx/?utm_source=slideshare&utm_medium=referral&utm_content=upcoming+smx