opticon 2017 running experiment engines with stats engine
TRANSCRIPT
Experimenting with Stats EnginePete KoomenCo-founder, CTO, Optimizely@[email protected]
opticon2017
AgendaHere
1. Why we built Stats Engine2. How to make a decisions with Stats
Engine3. How to scale your decision process
opticon2017
opticon2017opticon2017
Why we built Stats Engine
The study followed 1,291 participants for 10 years.
No exercise: 438 with 128 deaths (29%)Light exercise: 576 with 7 deaths (1%)Moderate exercise: 262 with 8 deaths (3%)Heavy exercise: 40 with 2 deaths (5%)
“Thank goodness a third person didn't die, or public health
authorities would be banning jogging.”
– Alex Hutchinson, Runner’s World
“A/A” results
The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell1. Run your experiment until you have reached
the required sample size, and then stop.2. Ask “What are the chances I’d have gotten
these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.
1908Data is expensive.
Data is slow.Practitioners are trained.
2017Data is cheap.Data is real-time.Practitioners are everyone.
The T-test was designed for this world
T-Test Pitfalls1. Peeking2. Multiple comparisons
1. Peeking
p-Value < 5%. Significant!
p-Value > 5%. Inconclusive.
p-Value > 5%. Inconclusive.
Min Sample Size
Time
Experiment Starts p-Value > 5%. Inconclusive.
Why is this a problem?
There is a ~5% chance of seeing a false positive each time you peek.
p-Value < 5%. Significant!
p-Value > 5%. Inconclusive.
p-Value > 5%. Inconclusive.
Min Sample Size
Time
Experiment Starts p-Value > 5%. Inconclusive.
4 peeks —> ~18% chance of seeing a false positive
The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell1. Run your experiment until you have reached the required sample size, and then stop.2. Ask “What are the chances I’d have gotten these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.
1:45 2:45 3:45 4:45 5:45
Solution: Stats Engine uses sequential testing to compute an “always-valid” p-value.
2. Multiple Comparisons
© Randall Patrick Munroe, xkcd.com
© Randall Patrick Munroe, xkcd.com
- - - - -
Metrics
1 2 3 4 5
Variations
A
B
C
D
Control
False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
Variation
👍 Use “visitors remaining” to decide whether continuing your experiment is worth it.
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
A
B
AB
“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
👍 Statistical Significance rises whenever there is strong evidence of a difference between variation and control
“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
0
Variatio
Variation
👍 Statistical Significance will “reset” when there is strong evidence of an underlying change.
Variation
👍 If your point estimate is near the edge of its confidence interval, consider running the experiment longer.
-19.3% -2.58%
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.
Stats Engine treats each metric as a “signal”.
High Signal metrics are directly affected by the experiment
Low Signal metrics are indirectly or not at all affected by the experiment
False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more low signal metrics and variations are added to a test.
Variations
A
B
C
D
Metrics
1 2 3 4 5 6 7 8
Primary Secondary Monitoring
…
👍For maximum velocity, use “high signal” primary and secondary metrics.
👍Use monitoring metrics for “low signal” metrics.
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
Max False Discovery Rate
👍 Use your Statistical Significance threshold to control risk vs. velocity.
opticon2017opticon2017
How to scale your decision process
Risk vs. Velocity for Experimentation ProgramsGetting organizational buy-in
👍Define “risk classes” for your team’s experiments
👍Keep low-risk experiments “low touch”
👍Save data science analysis resources for high risk experiments
👍Run high-risk experiments for 1+ conversion cycles to control for seasonality
👍Rerun high-risk experiments
Risk vs. Velocity for Experimentation Programs
👍Decide how and when you’ll share experiment results with your organization.
👍Write down your “decision process” and socialize with the team
Getting organizational buy-in