lies, damned lies, & search marketing statistics by adria kyne
TRANSCRIPT
#SMX #12C3 @AdriaKHow to Avoid the First Two When Producing the Latter
Lies, Damned Lies, and Search Marketing Statistics
Adria KyneVistaprint
#SMX #12C3 @AdriaK
Problems• Using samples that are too small• Using significance as a stopping point for a testSolutions• More rigor with fixed-sample tests• Using sequential sampling tests• Bayesian testingBonus Pro Tip for achieving valid samples
Today’s Topics
#SMX #12C3 @AdriaK
1) Make sure that we understand what actually happened
2) Be sure that we can use these results to predict the future
What is the Whole Point of This Anyway?
#SMX #12C3 @AdriaK
1. We want to know whether the variation is better, worse, or the same as the original.
2. We don’t want to see a positive outcome that isn’t really there— a false positive or Type I error
3. We don’t want to miss a positive outcome—a Type II error.
Basics of Hypothesis Testing
#SMX #12C3 @AdriaK
Your product page has an average 2.0% CR. You make a bunch of tweaks to the design, and after 30,000 visits, your CR is 2.25%.
You think you’re a genius, and so you tell your boss. Score!
#1 A Common (Sad) Story
#SMX #12C3 @AdriaK
At the end of the month, your revenue is no higher.You look bad.
The change you saw was not “significant,” because your sample size wasn’t big enough.
Yes, 30,000 visits was not enough.
You spoke too soon.
#SMX #12C3 @AdriaK
I gotta be cruel to be kind.
#SMX #12C3 @AdriaK
The smaller the difference, the bigger the sample you’ll need:
2% - 3% is a 50% increase
2.0%-2.5% is a 25% increase
2.0% - 2.25% is a 12.5% increase
For standard A/B hypothesis tests
#SMX #12C3 @AdriaK
Decide on How Much Impact Your Change Should Have
Visits CROrder
s AOVRevenu
eAnnual
Increase20,00
02.00
%
400 $50 $20,00
0 20,00
02.25
%
450 $50 $22,50
0 $30,000 20,00
02.50
%
500 $50 $25,00
0 $60,000 How much of a difference do you want to be able to detect with your test?
#SMX #12C3 @AdriaK
“power analysis for two independent proportions”
Pick a Sample Size Calculator
minimum sample size
we’re showing the variants to different visitors
we’re comparing rates, which are proportions
#SMX #12C3 @AdriaK
Is variation is higher or lower than the original? “two-tailed test.”
A 5% significance level is common—that is, there’s a 5% chance of a false positive
80% statistical power is common—there is a 20% chance (1 in 5) that if there was an effect, we’d miss it.
Calculator Options
http://bit.ly/25zI5Rv
P1 = your control CR, e.g. 0.02 for 2%P2= your likely test CR, e.g. 0.025 for 2.5%
#SMX #12C3 @AdriaK
The effect of using 0.05 and 80% is that we are 4 times more likely to get a false negative than a false positive
We’re more concerned about making things worse
We accept a higher chance that won’t see a positive effect that is actually there
Consequences of Significance and Power Choices
#SMX #12C3 @AdriaK
Those are arbitrary choices.We’re not testing pharmaceuticals.Are we really so terrified that we’ll roll out a page that isn’t an improvement?
NOBODY IS GOING TO DIE
#SMX #12C3 @AdriaK
Means that I love you.Baby.
#SMX #12C3 @AdriaK
Necessary Sample Sizes
1% change
13,809
3,826
0.5% change
52,238
#SMX #12C3 @AdriaK
Requires 52,238 Visits
Detecting a 12.5% increase in Conversion Rate
For each sample
#SMX #12C3 @AdriaK
Photo by Marilynn Windust https://ronmitchelladventure.com
#SMX #12C3 @AdriaK
You’re hoping for a 0.25% uplift on a 2.0% average CR.
The Control is getting 2.0% CR, and the Variant is getting 3.0% CR!
#2 Another Common (Sad) Story
“Why haven’t we switched to the test variant? It’s CLEARLY
WINNING.”
#SMX #12C3 @AdriaK
So you test the significance level.
Success! The difference is significant. You roll out the new page, and...
...nothing happens
And this is how things go awry
#SMX #12C3 @AdriaK
A significance calculation assumes that the sample size was
fixed in advance
It assumes that you have a valid sample
So when you ignore this and run until you get a “significant result,” you’re misusing the math
Why didn’t it work?
#SMX #12C3 @AdriaK
If you hit a period that happens to be performing well
You may succumb to the temptation to stop while you’re ahead
Repeated significance testing increases the rate of false positives
Friends don’t let friends test significance prematurely
Image: Public Domain, via Wikipedia
#SMX #12C3 @AdriaK
Why repeated significance testing is a problem
#SMX #12C3 @AdriaK
5% significance means that even if there is no difference between the test and the control
We’ll see an imaginary difference in the test 5% of the time
Remember what significance means?
#SMX #12C3 @AdriaK
Repeated Significance Testing is The Devil
Given: there is no actual difference between two test variants
Option 1 Option 2 Option 3 Option 41st observation
Significant No difference
Significant No difference
2nd observation
- Significant - No difference
End of Test Significant Significant Significant No difference
Likelihood ? ?
Option 1 Option 2 Option 3 Option 41st observation
Significant No difference
Significant No difference
2nd observation
Significant Significant No difference
No difference
End of Test Significant Significant No Difference
No difference
Likelihood 5% chance 95% chanceOption 1 Option 2 Option 3 Option 4
1st observation
Significant No difference
Significant No difference
2nd observation
- Significant - No difference
End of Test Significant Significant Significant No difference
Likelihood 26% chance 74% chance
Option 1 Option 21st observation Significant No difference Likelihood 5% chance 95% chance
#SMX #12C3 @AdriaK
See the slippery slope in action!
Day 1Control 150
2.00% 2.01%Variant 175
2.25% 2.35%Visits/Variant 7,460
not
Day 1 Day 2Control 150 313
2.00% 2.01% 2.10%Variant 175 332
2.25% 2.35% 2.23%Visits/Variant 7,460 14,920
not not
Day 1 Day 2 Day 3Control 150 313 448
2.00% 2.01% 2.10% 2.00%Variant 175 332 498
2.25% 2.35% 2.23% 2.23%Visits/Variant 7,460 14,920 22,380
not not not
Day 1 Day 2 Day 3 Day 4Control 150 313 448 636
2.00% 2.01% 2.10% 2.00% 2.13%Variant 175 332 498 695
2.25% 2.35% 2.23% 2.23% 2.33%Visits/Variant 7,460 14,920 22,380 29,840
not not not not
Day 1 Day 2 Day 3 Day 4 Day 5Control 150 313 448 636 750
2.00% 2.01% 2.10% 2.00% 2.13% 2.01%Variant 175 332 498 695 835
2.25% 2.35% 2.23% 2.23% 2.33% 2.24%Visits/Variant 7,460 14,920 22,380 29,840 37,300
not not not not SIGNIFICANT
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6Control 150 313 448 636 750 922
2.00% 2.01% 2.10% 2.00% 2.13% 2.01% 2.06%Variant 175 332 498 695 835 993
2.25% 2.35% 2.23% 2.23% 2.33% 2.24% 2.22%Visits/Variant 7,460 14,920 22,380 29,840 37,300 44,760
not not not not SIGNIFICANT not
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7Control 150 313 448 636 750 922 1098
2.00% 2.01% 2.10% 2.00% 2.13% 2.01% 2.06% 2.10%Variant 175 332 498 695 835 993 1174
2.25% 2.35% 2.23% 2.23% 2.33% 2.24% 2.22% 2.25%Visits/Variant 7,460 14,920 22,380 29,840 37,300 44,760 52,220
not not not not SIGNIFICANT not not
#SMX #12C3 @AdriaK
Smart marketers PRE-COMMIT to a valid sample size
And do not test for significance before they’ve collected it!
Therefore:
#SMX #12C3 @AdriaK
Because you have to be able to satisfy impatient observers
But I neeeeeeed to test significance repeatedly!
#SMX #12C3 @AdriaK
Solves the problem of repeated significance testingAllows you to stop the test early if the Variant is a winnerWorks with low conversion rates (under 10%)
Sequential A/B Testing
Image: http://geneticsandbeyond.blogspot.com/2014/08/the-puffinss-lair-sweat-of-hippos.html
#SMX #12C3 @AdriaK
1. Determine your sample size N (number of total conversions)
2. Measure the success of your Control and Variant groups
3. Check for stopping points If Variant - Control = 2.25√N the Variant wins If Control - Variant = 2.25√N the Control wins If Variant + Control = N, there is no winner
Sequential experiment design
http://bit.ly/1sSDz29
#SMX #12C3 @AdriaK
Sequential Sampling Calculator
http://bit.ly/1TM1LKv
#SMX #12C3 @AdriaK
Given a baseline conversion rate pMinimum detectable effect you want to see is d
1.5p + d < 36%When less than 36%, a sequential test will be shorter
p = 2.0%, d = 12.5% (2.25% CR) 1.5p + d = 15.5%
When to choose a fixed sample vs. a sequential test
#SMX #12C3 @AdriaK
Variant CR = better than ControlP-value = 0.18 (i.e. greater than our 0.05 significance level)
When Good Math Leads to Bad Career Moves
So how did the test go?
Neither. We didn’t achieve significance.
So which version won?
We stopped this morning.
So why did you stop it?!
Just show it to another 10,000 visitors.
We can’t do that. We have to accept that the test is over.
This guy is not a team player.
I am so screwed..
Well, the null hypothesis... blah blah blah
Blah blah p-value blah blah hlah blah
Image: 20th Century Fox via Amazon
#SMX #12C3 @AdriaK
Communicating results is hard.
So which one performs better?
There is a 95% probability that the results we saw are not due to random chance!
Why can’t this guy just answer a straight question?
I hate my life.Image: 20th Century Fox via Amazon
#SMX #12C3 @AdriaK
How to stop your test at any time and still make valid inferences!!
Much easier to understand and explain the results!!
Bayes’ Theorem
Image via Wikipedia
#SMX #12C3 @AdriaK
Frequentist BayesianAssumes that there is no difference,and finds the probability that chance alone could have produced the experimental results seen
Focuses on not getting Type I errors
Most people don’t understand what the results mean
What’s the Difference?
Finds the probability that the test is better
More forgiving of Type I errors
Easier to understand and communicate to non-technical audiences
#SMX #12C3 @AdriaK
Calculus
Why Don’t Marketers Use Bayes’ Theorem?
This formula determines is the probability that B will beat A in the long run. There’s a slightly different one if you have three test groups, etc.
#SMX #12C3 @AdriaK
Online calculators are your friends!
But Wait!
#SMX #12C3 @AdriaK
Wins and losses dataGraph• Probability distributionsTable• Probability of being
best• Spread of conversion
rates
Cool online Bayesian calculator
http://bit.ly/24mKJaY
#SMX #12C3 @AdriaK
1. Decide on the probability you’re comfortable with
2. Decide how much variance you’re willing to accept
How to use this calculator
#SMX #12C3 @AdriaK
96% probability that B is betterBut what’s the real CR?Needs more data
High spread, less overlap
#SMX #12C3 @AdriaK
Not very much CR varianceBut B is only 70% likely to be better
Low spread, high overlap
#SMX #12C3 @AdriaK
Variance of CR isn’t as badSeparation of peaks means that the CRs are different94% probability that B is probably betterWe aren’t certain about the actual CR
Less spread, less overlap
Sample size is only 100 conversions each!
#SMX #12C3 @AdriaK
You might actually see
#SMX #12C3 @AdriaK
Allows you to start the test with some assumptions, called “priors”
Can include: • the prior success probability (our belief about the
average conversion rate)• How much variance you expect
Bayesian’s interesting twist
#SMX #12C3 @AdriaK
1. Set your “priors”2. Input your test data3. Get back the
probability that the test variant performs better
Different cool Bayesian calculator
http://bit.ly/1Wzrtro
#SMX #12C3 @AdriaK
Actual, Understandable Results
#SMX #12C3 @AdriaK
You can make inferences from low traffic and low conversions
When someone says "What's the probability that the new page outperforms the old one?", you can give them an answer!
Advantage of Bayesian results
#SMX #12C3 @AdriaK
1. You know how not to run a fixed sample test 2. You know you can run a sequential sample test
when you need ongoing information about the results
3. You know how to run a Bayesian test, where you can keep checking your progress AND explain the results easily
So now what?
#SMX #12C3 @AdriaK
Are you trying to detect a big difference, or a small difference?
Use the formula 1.5p + d big difference - use a normal fixed sample test
(>36%) small difference - use a sequential test (< 36%)
Do the people you report to get confused or unhappy when you try to explain significance and p-values to them?
Run a Bayesian test
Review: How to Design your Experiment
#SMX #12C3 @AdriaK
Tests using significance Bayesian test1. Use a sample calculator 2. Run the test for the specified
sample3. Profit!
So That’s It, Then?
1. Decide how solid you want your probability estimate to be
2. Run the test and update the data
3. Profit!
#SMX #12C3 @AdriaK
I’m all about the tough love.
#SMX #12C3 @AdriaK
We are not measuring consistent user groups• Time of day• Day of week• Seasonality• Sales
The Problem of Illusory Lift
#SMX #12C3 @AdriaK
Run your tests long enough to cover at least
one entire traffic/conversion cycle
Monday-Sunday or equivalent full week
Account for business cycles
#SMX #12C3 @AdriaK
Daily differences in performance
#SMX #12C3 @AdriaK
Don’t run your test too long
Visitors delete their cookies and will pollute your samples
Account for user behavior
#SMX #12C3 @AdriaK
Nearly 40 percent of Internet users delete cookies from their primary computers on at least a monthly
basis
53 percent delete cookies, cache or browsing history to help protect their privacy online
It’s probably more than you think
JupiterResearch 2005
TRUSTe/National Cyber Security Alliance U.S. Consumer Privacy Index January 2016
#SMX #12C3 @AdriaK
• Pre-commit to a sample size/experimental design• Fixed Sample A/B testing – no peeking before it’s
done• Sequential A/B testing – built-in peeking • Bayesian – easier to understand the results• Collect samples for a full business cycle, but not too
long
Summary
#SMX #12C3 @AdriaK
Fixed sample calculator Stats Dept., U of British Columbia http://bit.ly/25zI5RvSequential sampling calculatorEvan Miller http://bit.ly/1TM1LKv Simple Bayesian calculatorPeak Conversion http://bit.ly/24mKJaY Bayesian calculator with priorsLyst http://bit.ly/1Wzrtro
Calculators I used
#SMX #12C3 @AdriaKLEARN MORE: UPCOMING @SMX EVENTS
THANK YOU! SEE YOU AT THE NEXT #SMX